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NEIGHBOURHOODS OF PHYLOGENETIC TREES: EXACT AND 

ASYMPTOTIC COUNTS 


J. V. DE JONG,t, J. C. MCLEODt, AND M. STEEU 

Abstract. A central theme in phylogenetics is the reconstruction and analysis of evolutionary trees from a 
given set of data. To determine the optimal search methods for reconstructing trees, it is crucial to understand 
the size and structure of the neighbourhoods of trees under tree rearrangement operations. The diameter and 
size of the immediate neighbourhood of a tree has been well-studied, however little is known about the number of 
trees at distance two, three or (more generally) k from a given tree. In this paper we provide a number of exact 
and asymptotic results concerning these quantities, and identify some key aspects of tree shape that play a role 
in determining these quantities. We obtain several new results for two of the main tree rearrangement operations 
- Nearest Neighbour Interchange and Subtree Prune and Regraft — as well as for the Robinson—Foulds metric on 
trees. 
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1 Introduction Phylogenetics is the study of evolutionary relationships between species. 
These relationships are represented as phylogenetic trees, where the leaves correspond to extant 
species and the interior vertices correspond to ancestral species. A branch between two species 
in a tree indicates an evolutionary relationship between them [24, 13]. Central to phylogenetics 
is the problem of finding the optimal tree to fit a given data set, with the aim of determining the 
evolutionary history of the species being studied. However the number of possible phylogenetic 
trees grows rapidly with the number of leaves, so for data sets with a large number of leaves, 
the optimal tree is commonly found by searching the set of phylogenetic trees (tree space) via 
tree rearrangement operations [19, 26]. Tree rearrangement operations are also used to compare 
phylogenetic trees by looking at the distance (smallest number of tree rearrangement operations) 
between the trees. These could be trees obtained from the same data set using different search 
methods, or from different data sets on the same set of species [11, 10]. 

In order to effectively search tree space using tree rearrangement operations, it is crucial 
to understand the size and structure of the neighbourhood of (i.e. the set of trees obtained 
from) a phylogenetic tree under these operations. In this paper, we investigate the size of the 
neighbourhoods of trees arising from two commonly used tree rearrangement operations: Nearest 
Neighbour Interchange (NNI) and Subtree Prune and Regraft (SPR), as well as the Robinson- 
Foulds (RF) distance. Fig. 1.1 shows examples of the RF, NNI and SPR distances between trees. 
Expressions for the number of trees at distance one or two from a given tree under RF, distance 
one, two or three under NNI, and distance one under SPR and Tree Bisection and Reconnection 
(TBR) are already known [6, 22, 1, 17]. We provide new asymptotic expressions for the number 
of trees at distance k from a given tree under NNI and the RF distance. We also show that unlike 
NNI and RF, the number of trees at distance two from a given tree under SPR is dependent 
on the shape of the tree, and cannot be expressed solely in terms of the number of leaves and 
cherries of the tree. 

The literature on the structure of tree neighbourhoods and tree space includes results regard¬ 
ing the smallest number of NNI operations required to reach every tree in the set [14, 8], and the 
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Fig. 1.1. Here, T\ and T 2 are unrooted binary phylogenetic trees with seven leaves. They are (i) distance 
two apart under the RF metric, (ii) distance two apart under the NNI metric, and (Hi) distance one apart under 
the SPR metric. Tree T is obtained from Ti or T 2 by contracting the two internal edges indicated by dotted lines. 


characterisation of the splits appearing in trees within a certain distance of a given tree under 
various distance measures including RF, NNI, SPR, and TBR [4]. Here, we provide asymptotic 
results for the number of binary (fully resolved) trees that are a specified (small) distance from a 
given binary tree under the RF metric. Recently, Allen and Steel [1] established the asymptotics 
at the other end of the distribution. They showed that the proportion of binary trees that are 
at nearly maximal distance from each other follows a Poisson distribution whose mean depends 
on the proportion of leaves of the given tree that lie in a cherry (a path of length two where 
both endpoints are leaves of the tree). Using the expressions for the sizes of the first and second 
neighbourhoods, we provide an exact count for the number of pairs of binary phylogenetic trees 
with n leaves that share a first neighbour under NNI and RF. 

2 Definitions A graph G is an ordered pair (V(G), E{G)) consisting of a vertex set V(G) 
and an edge set E[G). For any vertices x,y G U(G), x and y are adjacent if there is an edge 
e G E{G) such that e = {x,y}. We call x and y the endpoints of e, and vertex x and edge e 
are said to be ineident. Two distinct edges e, / G E{G) are adjacent if they have an endpoint 
in common. Edges e, / G EiG) are parallel edges if they have the same endpoints. An edge 
/ = {x, x} where x gV (G) is called a loop. A graph is simple if it has no loops or parallel edges. 
All of the graphs referred to in this paper are simple. 

The degree of a vertex v gV (G) is the number of vertices in V (G) that are adjacent to v, and 
is denoted deg{v). The Handshaking Lemma is a well-known result stating that for a graph G, 
Luev(G) = 2|i?(G)| (see [2] for more detail). 

A graph iJ is a subgraph of G if V{H) C U(G) and E{H) C E{G). If V{H) C U(G) or 
E{H) C E{G), then iJ is a proper subgraph of G. A path P in G is a subgraph of G which consists 
of a sequence of distinct vertices woj Vk such that for all i G {0,1,..., k — 1}, Vi and vt+i are 
adjacent in P. We call this a path of length k. We may also refer to P as a (uq— "Cfej-path or an 
(e—/)-path where e = {uq, ui} and / = {vk-i, Vk}- Note that paths are regarded as undirected, 
so a (vo — Ufe)-path in G is considered identical to a (vk — ?;o)-path in G. A cycle is a path in 
which the first and last vertices are the same (i.e. vq = Vk). The subgraph of a graph G induced 
by the vertex set V C U(G) is the subgraph with vertex set V and edge set E C E{G), where E 
consists of all the edges of G that have both endpoints in V. 

Two vertices x,y G V (G) are connected if there is an (a:— 2 /)-path in G. A graph G is 
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connected if all pairs of vertices x,y £ V (G) are connected. A component of G is a maximal 
connected subgraph of G. 

The distance between two vertices x,y G V{G), denoted dG{x,y), is the length of the short¬ 
est (a;—?/)-path in G. We define the distance between two vertex sets, U = and 

V = {ui, V 2 , ■•■} to be dciU, V), where 

dciU.V) = miTL{dG{ui,Vj) :l<i< |G|,1 <j< \V\}. 

The diameter M of G is given by 


M = max{dG{vi,Vj) : Vi,Vj G E{G)}. 

Two graphs G and G' are isomorphic if there is a bijection cr : 1^(G) ^ V{G') such that 
for all pairs of vertices x,y G V{G), x and y are adjacent in G if and only if a{x) and a{y) are 
adjacent in G'. 

2.1 Trees A tree T is a connected graph containing no cycles. A forest is a graph whose 
components are trees. A tree is rooted if it has a distinguished root vertex; otherwise, it is un¬ 
rooted. A leaf of a tree T is a vertex of T that has degree one. The leaf set C{T) C V(T) of a tree 
T is the set of all leaves in T. Vertices of T that are not leaves, are called internal vertices. If an 
edge of T is incident to a leaf, we call it a pendant edge of T; otherwise, it is an internal edge of T . 

A binary tree is a tree in which all internal vertices have degree three. A binary phyloge¬ 
netic tree T is a binary tree with a bijection (/> : A —>■ C{T) where A is a set of n labels (see 
Fig. I.l). Let UB(n) be the set of all unrooted binary phylogenetic trees with n leaves. For 
trees Ti, T 2 G 17 B(ji), we say that Ti and T 2 are equal (Ti = r 2 ) if they are isomorphic by a map 
that preserves the leaf labelling. In this paper we will use the term ‘tree’ to refer to an unrooted 
binary phylogenetic tree unless otherwise stated. 

A cherry in a tree T is a path of length two in which both endpoints are leaves of T. Let 
UB{n,c) be the set of all unrooted binary phylogenetic trees with n leaves and c cherries. For 
example, in Fig. 1.1, Ti G UB{7, 3) and T 2 G UB{7, 2), while T is not a binary tree. 

2.2 Subtrees A subtree of a graph G is a subgraph of G that is a tree. All connected 
subgraphs of a tree T are subtrees. The distance in T between a subtree T' of T and a set of 
vertices V C V(T) is dT{V{T'),V), but we will simply write this as driT',V). Throughout 
this paper we assume that all subtrees are proper subtrees, and have the property that if T' is 
a subtree of T G UB{n), then C(T') C C{T). This ensures that T' has at least one vertex of 
degree two. If T' has exactly one vertex of degree two then it is a pendant subtree] otherwise, it 
is an internal subtree. Unless otherwise specified, all subtrees in this paper are maximal, pen¬ 
dant subtrees. We may refer to the vertex v of degree two in a pendant subtree T' as the root of T'. 

An edge e in a tree T is incident to a subtree T' of T if e is incident to a vertex of degree 
two in T. The intersection Ti n T 2 of two subtrees Ti and T 2 of a tree T, is a subtree of T in 
which U(ri n T 2 ) = V{Ti) n V{T 2 ) and E{Ti n T 2 ) = E{Ti) n E{T 2 ). 

A tree T is a caterpillar if the subtree induced by the internal vertices of T is a path. A 
balanced tree is a tree in which all leaves are equidistant from a single vertex or edge. Fig. 2.1 
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shows a caterpillar and two balanced trees, all of which are binary. 




Fig. 2.1. Examples of (a) a caterpillar and (b), (c) balanced trees. 


We define Pk{T) to be the number of distinct paths of length k in T. An internal path P of 
a tree T is a path in which all vertices of P are internal vertices of T. We use Pk{T) to denote 
the number of distinct internal paths of length k in T. We have the following lemma relating 
the paths and internal paths of a tree. The proof is straightforward. 

Lemma 2.1. Let T £ UB(n) where n > 4. Then for k > 3, Pk{T) = 4 pfe_ 2 (T). 

2.3 Edge and Vertex Operations Given a tree T £ UB{n), if we delete an edge 
e £ E{T), we obtain the forest T \ e where V(T \e) = V{T) and E{T \ e) = E{T) — {e}. 
We contract an edge e = {x, y} of T to obtain a new non-binary tree, denoted T/e, by deleting e 
and combining x and y into a single vertex lu, such that all vertices adjacent to a: or y in T are 
adjacent to w in T/e. Fig. 1.1 shows the tree T resulting from the contraction of two internal 
edges of a tree Ti. 

Let Ti be a tree with edge e = {x, y}. We subdivide e by deleting e and inserting a vertex u 
and edges ei = {x,u} and 62 = {u,y} to obtain a non-binary tree T[. Given a non-binary tree 
T 2 , we suppress a vertex u of degree two, by deleting u and its incident edges ei = {x,u} and 
62 = {u^y}, and inserting a single edge e' = {x,y} to obtain a tree T^. Edge subdivision and 
vertex suppression are inverse operations. 

In this paper, when we perform any of the above operations, we assume that all edge and 
vertex labels in the original tree are preserved by the operation, except those explicitly deleted 
or inserted. 

2.4 Splits A partition of a set X is a set of disjoint, non-empty subsets {Xi, X 2 ,..., X^}, 
m > 1, such that X = U^j^X^. A partition of X is a bipartition if m = 2. Gonsider a 
tree T £ UB{n). A bipartition {Li, L 2 } of C{T) is a split of T if there exists an edge e £ E(T) 
such that T \ e has components Ti and T 2 with C(Ti) — Li and C(T 2 ) = ^ 2 - We define 
S(T, e) = {Li, L 2 } as the split of T associated with e. A split S(T, e) is trivial if e is a pendant 
edge of T. We define E(T) = {S{T, e): where e is an internal edge of T} as the set of all non¬ 
trivial splits of T. Two trees Ti and T 2 are equal if and only if E(Ti) = E(T 2 ) [7]. 
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The following lemma gives two well known expressions, one for the number of internal edges 
in a binary tree, and one for \UB{n)\ (see [24] for more detail). 

Lemma 2.2. 

(i) Let T G UB[n), n > 3. Then T has n — 3 internal edges. 

(ii) For all n G Z+, n > 3, we have \UB{n)\ = (J.^ 2 )T 2 "-^ • 

2.5 Neighbourhoods In this paper we consider three metrics: Robinson-Foulds (RF), 
Nearest Neighbour Interchange (NNI), and Subtree Prune and Regraft (SPR), defined in Sec¬ 
tion 3, Section 4, and Section 6 respectively. 

Given one of these three metrics 60 , 9 G {RF, NNI, SPR}, on UB{n), the neighbourhood 
of a tree T, denoted Ng{T), is given by 

N^{T) = {T' G UB{n) : 5e(T,r') = k}. 

A tree T' G Ng{T) is called a neighbour of T. 

3 Robinson—Foulds Metric Given two trees Ti, T 2 G UB{n), the Robinson-Foulds (RF) 
distance between Ti and T 2 is defined by 

5rf(Ti,T 2) = i|I](Ti) - S(T2)| + i|S(r2) - E(ri)|. 

Alternatively, the RF distance between Ti and T 2 can be seen as the minimum m for which there 
exist El C E{Ti) and E 2 C E{T 2 ) where |i?i| = IA 2 I = m, such that Ti/Ei = T 2 /E 2 . This is 
illustrated in Fig. 1.1, where SfiF{Ti,T 2 ) = 2. 

The RF neighbourhood of a tree T G UB{n) is the set of trees in UB{n) that are exactly 
RF distance k from T. In terms of edge contraction, this neighbourhood consists of all trees 
T' G UB{n) such that the minimum j for which we could contract j edges of T and j edges of 
T' and obtain the same (non-binary) tree, is k. 

The RF distance was originally introduced by Bourque [3] and was generalised by Robinson 
and Foulds [23]. Unlike the metrics induced by NNI and SPR that we will see in later sections, 
the RF distance between two trees is computationally easy to calculate. (Day [12] provided a 
linear-time algorithm.) In this section we consider the first, second and RF neighbourhoods 
of an unrooted binary phylogenetic tree. Let T G UB{n,c) where n>3 (recall that UB{n,c) is 
the set of unrooted binary phylogenetic trees with n leaves and c cherries). Then 

1. \NjiF{T)\ = 2(n-3), and 

2. \N'^pIt)\ = 2 n^ - 8 n + 6 c- 12. 

This expression for the size of the first RF neighbourhood is commonly known, and as we will 
see later, is the same as the size of the first NNI neighbourhood, found by Robinson [22]. The 
expression for the size of the second RF neighbourhood appears in Section 4.2 of [6]. 

Much of the literature on the RF distance has focused on calculating the RF distance be¬ 
tween two trees, and on the distribution of the distances between trees. Bryant and Steel [6] 
gave a polynomial-time algorithm for finding the distribution of trees around a given tree T, 
and showed that this distribution can be approximated by a Poisson distribution determined 
by the proportion of leaves of T that are in cherries. Bendy et al. [15] used generating function 
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techniques to calculate the probability that two trees, selected uniformly at random from UB{n), 
are RF distance m from each other. 


While the sizes of the first and second RF neighbourhoods are known, the sizes of higher 
neighbourhoods are not known in general. Although N'^p depends on the shape of T (via c), for 
fc = 1, 2 we can write Npp = ^ ^7 (1 + 0{n~^)). Our main result in this section (Theorem 3.1) 
provides a generalisation of this asymptotic equality to all values of fc > 1. 


Theorem 3.1. LetT GUB{n) (n>A). For each fixed k €, 

(l + OT.fcn ^ + 0{n ^)) , 


where 


+ 7k 
4 


< Cx^k 4fc^ — 7k. 


(3.1) 


The proof of Theorem 3.1 comprises two steps. First, we determine the number of binary 
phylogenetic trees whose splits differ from S(r) by exactly the k splits associated with a given 
subset of k internal edges of T. We then determine the number of subsets of k internal edges 
in T. We consider three cases: 

1. The k edges are pairwise non-adjacent. 

2. Exactly two of the k edges are adjacent. 

3. More than two of the k edges are adjacent. 

The term of order in Equation (3.1) is completely determined by Case 1 above, while the 
term of order is determined by Cases 1 and 2. We show that all other possibilities for the 
k edges, (covered by Case 3) only contribute to terms of order or lower. 

Neighbours with Different Splits over k Given Edges 

Let Efe be a given set of k splits of T e UB{n) {k > 1). We define 

A(r, Efe) = \{r G UB{n) : (E(T) - E^) C E(r')}|, 
as the number of trees containing the splits E(r) — and 

A(T, Efc) = \{r e UB{n) : (E(r) n E(T')) = E(r) - Efell, 

as the number of trees containing the splits E(T) — E^, and no other splits of T. 

In Lemma 3.2 we obtain an expression for A(r, E^) and show that once T and Ej, are spec- 

o 

ified, A(T, Efe) is independent of n. 
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Lemma 3.2. Let T G UB(n) (n > let e\, Ck (1 < k < n — 3) be distinet internal 
edges of T, and let he the set of k splits of T associated with these edges. Define F to be the 
subgraph ofT consisting of the edges ei,..., Ck- Then 
(^) 


A(T,Efe) 


A ( (2m+ 2)! y- 

V(to + l)!2™+i ) ’ 

m—1 ^ ^ ' ■' 


where Cm is the number of components of F with exactly m edges. 

(ii) Let T' S UB{s) (s > k + 3), and let F' he the subgraph of T' consisting of distinct 
internal edges e'l, ..., ej. of T'. Let be the set of k splits of T' associated with these 
edges. If F' is isomorphic to F, then 

A(r',S'fe) = A(T,Efc). 


In other words, the number of trees containing the splits S(T) — Yk and no other splits 
ofT, is not dependent on n. 

Proof. 

(i) Let Ci,...,Ci be the components of F. Given a component Ci with m edges, let Ai be 
the subtree of T consisting of the corresponding m edges of L’ in T and their adjacent 
edges (note that Ai may be an internal subtree). Then Ai has to + 3 leaves. We want 
to find A{Ai, Yi), where Yi is the set of splits associated with the internal edges of Ai. 
(Note that this is the same as the number of trees that are at most RF distance to 
from Ai.) Clearly A(>li,Ej) = \UB{\C{Ai)\)\ = \UB{m + A)\, as it is the number of trees 
in UB(rn + 3) that have at least zero splits in common with Ai. By Lemma 2.2, 


\UB{m + 3)\ 


(2(to + 3) -4)! 

((to+ 3) -2)!2(™+3)-2 


( 2 to + 2 )! 

(to + l)!2™+i' 


We can apply this principle to each component of F. The results for each component 
are independent of those for the other components of F. Therefore, we can take the 
product to obtain 


A(T,Efe)=nA(kl„E,)= n 


( 2 to + 2 )! 
(to + l)!2™+i 


(ii) This is similar to (i), except that we now restrict our attention to A(T, E^), that is, 
those trees in A(T, E^) that do not contain any of the splits in E^. Similarly to (z), we 
have 


A(r,Efe) = nA(A„E,). (3.2) 

Note that for each subtree Ai^ some of the trees counted by A(Ai, have splits in com- 

o o 

mon with Ai, and hence are not counted by A{Ai, Yi). Clearly, A{Ai, E^) is dependent 
on the shape and size of Ai, which itself depends on the choice of the k edges of T and 
not on the shape or number of leaves of T. Therefore, since F' = F, we have 

£ 

k{T', Y',) = n A(kl„ EO = A(T, Yk). 

2=1 
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□ 

O 

We now consider expressions for A(T, E'^,) and A{T, SJ.) where is the set of splits associ¬ 
ated with k distinct, pairwise non-adjacent internal edges of T. 

Lemma 3.3. Let T G UB(n) (n > A) and let EJ. (1 < k < n — 3j be the set of splits associated 
with distinct, pairwise non-adjacent internal edges ei,...,ek of T. Then 

(i) A(r, = 3^ and 

(n) k{T,^'^) = 2 K 

Proof. 

(i) This follows directly from Lemma 3.2. 

(ii) For some i, 1 < i < k, let Ai be the subtree of T consisting of edge Ci and its adjacent 
edges in T (note that Ai may be an internal subtree). Then Ai has four leaves. Note that 
A{Ai, S{Ai, Ci)) = |17i?(4)| = 3. However, one of these three trees is Ai. The remaining 
two trees each have a single internal edge, and the split associated with this edge is not 
S{Ai,ei). Hence A{Ai, S{Ai,ei)) = 2, and by Equation (3.2), A(T, EJ,) = 2^. 

□ 

Now that we have investigated the case where the k internal edges are pairwise non-adjacent, 
we consider an adjacent pair of internal edges. 


Lemma 3.4. Let T G UB(n) (n > 5), and let E 2 be the set of splits associated with two 

o 

adjacent internal edges of T. Then A(T, E 2 ) = 10. 

Proof Let E 2 = {S(T, ei), S(T, 62 )}, where ei and 62 are adjacent internal edges of T. The 
set of trees counted by A(T, E 2 ) includes trees which have one or more of the splits in E 2 in 

o 

common with T. Therefore, to obtain A(T, E 2 ), we subtract from A(T, E 2 ) the number of trees 
in UBin) that have exactly one split different to T associated with either ei or 62 , or the same 
splits as T. Hence 

A(r, E 2 ) = A(r, E 2 ) - A(T, S{T, ei)) - A(T, S{T, 62 )) - 1. 

o 

Therefore, by Lemma 3.2, if ei and 62 are adjacent, then A(T, E^) = 15 — 5 = 10. 

□ 

The Number of Subsets of k Internal Edges 


Lemma 3.5. Let T G UB{n) (n > 4). Then 

(i) The number of sets of k distinct, pairwise non-adjacent internal edges ei,..., Ck in T 
(\ < k < n — 3), denoted At^r, satisfies 

l„fe fc(5fc-bl)^fe_i , ^ ^ ^ fc(fc + 2)^fc_i 

fc! 




2 k\ ■ V y - 

(ii) The number of sets of k distinct internal edges ei,..., Ck in T (2 < k < n — 3) where 
exactly two edges are adjacent, denoted Br^k, satisfies 


1 


-b 0(n'=-^) < Ht./c < 


-b 0(n''-"). 


2 (k- 2 )\ ' y - - (/j_2)! 

(in) The number of sets of k distinct internal edges ei,..., Ck (3 < k < n — 3) in T where 
more than two edges are adjacent is 0(n^“^). 
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Proof. 

(i) We calculate the bounds by considering the best and worst case scenarios for the choice 
of each edge. There are n — 3 choices for the first edge ei. There are at most (n — 3) — 2 
choices for 62 (this can occur when ei has exactly one adjacent internal edge in T). 
There are then at most (n — 3) — 4 choices for 63 (this can occur when ei and 62 each 
have exactly one adjacent internal edge in T), and so on. Therefore 


fc < tt(ii — 3)(n — 3 — 2)(n — 3 — 2(2)) • • • (n 
’ k\ 


k-l 




k\ 


i=0 

1 f. k{k + 2 ) 


= - 


kl 


+ 0 (n'=-"). 


3-2(fc- 1)) 


On the other hand, there are at least (n — 3) — 5 choices for 62 (this can occur when ei 
has four adjacent internal edges in T). There are then at least (n — 3) — 10 choices for 
63 (this can occur when ei and 62 each have four adjacent internal edges in T), and so 
on. Therefore 

fc > tt(ii — 3)(n — 3 — 5)(n — 3 — 5(2)) • • • (n — 3 — 5(fc — 1)) 

’ k\ 


(ii) We will prove this in the same way as (i), assuming without loss of generality that ei 
and 62 are the adjacent pair of edges. There are n — 3 choices for ei. There are at most 
four choices for 62 (this can occur if ei has four adjacent internal edges in T). For 63 , 
there are at most (n — 3) — 3 choices (this can occur if ei and 62 each have two adjacent 
pendant edges in T). The remaining edges follow in the same way as in (1). Therefore 

BT.k < - 6)(n - e - 2(1}}...(n - e - 2(1: - 3)) 


On the other hand, there is at least one choice for 62 (this can occur if ei has exactly 
one adjacent internal edge in T). For 63 , there are at least (n — 3) — 7 choices (this can 
occur if 61 and 62 each have no adjacent pendant edges in T). The remaining edges are 
chosen in the same way as in (1). Hence 


Br.k > 


2{k-2)l^^ 

_ I _ 

2(A:-2)! 


3)(n — 10)(n 
+ 0 (n'=-^). 


10-5(l))...(n 


10-5(fc-3)) 


(iii) Let F be the subgraph of T consisting of the edges 61 ,..., Cfc. Then F has m < k — 2 
components. Suppose we first choose m internal edges of T corresponding to one edge 
in each component of F. By (i), the number of such choices is 0{n™), as each of these 
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edges will contribute a linear factor to the total number of ways of choosing the k edges. 
However, the remaining k — m > 2 edges must be chosen from edges that are adjacent 
to those already chosen. The number of these choices depends only on the number and 
location of the edges already chosen, and not on n. Hence the number of possible sets 
is 0{m), where m < k — 2. 


□ 

Note that in the proof of Lemma 3.5, it may not be possible to maximise (or minimise) the 
number of choices for each individual edge in T, however, this is not a problem as we only require 
bounds on the number of choices of the k edges of T. 

From Lemma 3.2, Lemma 3.3, and Lemma 3.4, we know the number of binary phylogenetic 
trees whose splits differ from those of T S UB{n) by exactly k splits over a given set of k edges. 
From Lemma 3.5, we have number of subsets of k internal edges. We are now in a position to 
prove Theorem 3.1. 


Proof of Theorem 3.1 We break down the calculation of the size of the k*^ RF neigh¬ 
bourhood of T into two steps. We consider the number of trees whose splits differ from those of 
T by exactly the k splits corresponding to a given set of k distinct internal edges of T. We then 
consider the number of ways these k edges can be chosen in T. By Lemma 3.2, given T and a 
set of k distinct internal edges of T with associated split set the number of trees with the 

o 

splits E(T) — Efc and none of the splits in (A(T, E^)), is independent of n. Hence, only the 
number of ways of choosing the k edges in T is dependent on n. 

By Lemma 3.5, when we count the number of ways of choosing k distinct internal edges of T, 
the case where the k edges are pairwise non-adjacent (Case 1 from the beginning of this section) 
gives a term of order and a term of order The case where exactly two of the k edges 

are adjacent (Case 2) produces a term of order n^~^, but does not have a term of order n^. If 
more than two of the k edges are adjacent (Case 3), then the highest order term is 0{n^~^). 

Now we consider the number of trees whose splits differ from those of T by exactly the k 
splits corresponding to a given set of k distinct internal edges of T. From the information above, 
the only two cases we need to consider are those where the k edges are pairwise non-adjacent, or 
exactly two of the k edges are adjacent. By Corollary 3.3, the case where all edges are pairwise 
non-adjacent produces 2^ RF neighbours with splits that differ from the splits of T over 
precisely the k given internal edges. In the case where exactly two edges are adjacent, the k — 1 
pairwise non-adjacent edges give 2^“^ RF neighbours, by Corollary 3.3. By Lemma 3.4, 
the adjacent pair of edges results in 10 neighbours. Hence, in total, there are 10 ■ 2^“^ RF 
neighbours. Therefore, by Lemma 3.5, 


\Nl,AT)\ > 



fc(5fc-f 1) 
2k\ 

4fc! 


j 2^= + 10 ( 

+ 0(n'=-2 


\^2{k-2)\ 

)• 



2fe-2+o(n'=-2) 
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2^= + 10 2^-^ + 0(n'=-2) 

2 '^ fc 4A:2-7fc . fc_2, 

=-rr’^ +-n- ^ n'^ ^ + O n'^ 2). 

k\ k\ 

D 

3.1 Shared Splits In this subsection, we present a simple and general upper bound on 
the proportion of binary trees that share at least k non-trivial splits with a given tree on the 
same leaf set. The relevance of this result for biology is that it shows that a ‘random’ binary 
tree (selected with uniform probability) has a low probability of sharing more than a few splits 
with a given tree, regardless of the number of leaves (species) involved and the topology of the 
given tree. For example, the probability of sharing three non-trivial splits is at most 0.02. 

Let To be a phylogenetic tree with n leaves, and let 7 rfc(ro) be the proportion of trees in 
UB{n) that share at least k non-trivial splits with Tq (note that Tq does not have to be a binary 
tree). Thus, 7 rfe(To) is the proportion of binary phylogenetic trees T for which 

dRF{T,To) < i(|io| -b n - 3 - 2k), 

where io is the number of internal edges of Tq. In general, 7 rfe(To) will depend on properties of 
the tree Tq; however, there is a universal upper bound on that applies for any choice of Tq 
and is independent of the number of internal edges in T, and even of n. This is provided by the 
following result. 

Theorem 3.6. For any phylogenetic tree Tq with n leaves, the proportion, Fk{TQ), of tree 
in UB(n) that share at least k non-trivial splits with Tq satisfies 

t.(To) < 

for all k = 1,2, ■■■ ,iQ and 7rfe(To) = 0 for all k > io, where Iq is the number of internal edges 
ofTQ. 

Proof. Let A^fe(To) be the number of trees in UB(n) that share at least k non-trivial splits 
with Tq. Let Eq = E(To), the set of non-trivial splits of Tq. We have 


NkiTo) = U {T € UB{n) : E C E(r)} 

SCEq: 

|E|=fe 

Therefore, by the union bound, we have 

iVfe(To) < ^ |{r e UB{n) : E C E(T)}|. 

ECSo: 

|S| = fc 

Since there are precisely (^*0^) terms in this sum, we obtain 


iVfc(To) < 


*0 

k 


■M, 


(3.3) 
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where 


M = max \{T G UB{n) : E C S(r)}|. 

|S| = fc 

Now, for a subset E of Eq let Ty, be the unique non-binary phylogenetic tree that has E as its set 
of not-trivial splits (i.e. the tree obtained from Tg by contracting each internal edge of Tg that 

o 

is not associated with a split in E). Let V{Ty.) denote the set of interior vertices of Ts. Then 


|{TGl7i?(n):ECE(r)}|= J] W B{deg{v))\. (3.4) 

'«eEint(rE) 


This is similar to the result in Lemma 3.2(i), but Tg is not binary. 

o 

Now, for each vertex v G we have deg(r!) > 3. Moreover, when |E| = k, a simple 

counting argument shows that 

|b^(TE)| = + 1 and ^ deg(u) = n-|-2fc. (3.5) 

vev{Ts) 

Leaving trees for a moment, consider the optimisation problem of maximising 0^]^ b{ni), subject 
to the constraints that ni,n 2 , • • • ,njv are integers, each taking a value of at least 3, and with 
a sum equal to i? > 3N. It follows from the faster-than-exponential growth of the function b 
that the maximum possible value is b{R — 3{N — 1)) (Lemma 5 of [25]). Taking N = k + 1 and 
R = n + 2k (from Equation (3.5)), so that R — 3{N — 1) = n — k, we see from Equation (3.4) 
that |{T G UB{n) : E C E(T)}| < b{n — k) for any E C Eg with |E| = k. In other words, from 
Equation (3.3), 


Nu{T^) < 

Consequently, 

T^k{To) = Nk{To)/b{n) = 
where the last inequality holds because ig < n — 3. 
Finally, notice that we can write 
— 3 


b{n — k). 


b{n — k)/b{n) < 


n — 3 
k 


b{n — k)/b{n), 


k 


bin - k)/b{n) = • n ” ^ ^ 


fc! J-A 2n-27'-5’ 
j=o 


and each of the k terms in the product is strictly less than i. This completes the proof. 

□ 
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4 Nearest Neighbour Interchange In this section we provide a new asymptotic expres¬ 
sion for the size of the Nearest Neighbour Interchange (NNI) neighbourhood of an unrooted 
binary tree. 

Let T € UB{n) and let e = {x,y} be an interior edge of T. Let Ai and A 3 be subtrees of 
T that are distance one from e and distance three apart (see Fig. 4.1). Then Ai and A 3 are 
swappable across e. Let vertex zi adjacent to x be the root of Ai, and Z 3 adjacent to y be the 
root of A 3 . An NNI operation on T is performed by deleting the edges {x,zi} and {y,Z 3 }, and 
inserting edges {x, Z 3 } and {y, zi}. We will also refer to this process as swapping the subtrees Ai 
and A 3 across e. The resulting tree is a first NNI neighbour of T. To make it clear which edge 
of a tree T two subtrees are swapped across in an NNI operation on T, we will refer to such an 
operation as an NNI operation on edge e in T. 

The two distinct first NNI neighbours resulting from an NNI operation on edge e in T can 
be seen in Fig. 4.1. We have four subtrees Ai, A 2 , A 3 and A 4 that are all distance one from e 
in T. To obtain T' from T we swap subtrees A 2 and A 3 , and to obtain T" we swap subtrees 
A 2 and A 4 . Note that swapping subtrees Ai and A 4 in T produces a tree isomorphic to T'. 
Although there are four different pairs of subtrees that could be swapped across e, there are only 
two distinct first neighbours that can be obtained from NNI operations on e. 


T 



Fig. 4.1. The two first NNI neighbours of T resulting from an NNI operation on the edge e. 

We see in Fig. 4.1 that in T' and T", all four subtrees Ai, A 2 , A 3 and A 4 are distance 
one from e. Given a labelling of the edges of the original tree T, we preserve this labelling by 
assigning the label Oi to the edge incident to subtree Ai in T and in the two first NNI neighbours 
of T resulting from an NNI operation on edge e. Note that T can also be obtained from T' by 
an NNI operation, which we call the inverse of the operation used to obtain T' from T. 

Consider a graph G in which each vertex represents a tree in UB{n) and there is an edge 
between the vertices representing trees Ti and T 2 if they are first NNI neighbours. The NNI 
distance between Ti and T 2 , Snni(Ti,T 2 ), is the distance between the two vertices representing 
trees Ti and T 2 in G. 

Throughout this section we will consider the trees resulting from a series of NNI operations 
beginning with a tree T £ UB{n). Let NNI{T;ei,e 2 , ■■■,ek) C be the set of 

trees that can be obtained by performing an NNI operation on internal edge ei in T to give Ti, 
followed by an NNI operation on internal edge 62 in Ti to give T 2 , and so on until we have 
completed k NNI operations. Note that if T' £ NNI{T;ei, ...,ek), T' is not necessarily a 
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NNI neighbour of T. It may instead be a NNI neighbour of T for some j < k {j € N). 

In this section we consider the sizes of the first, second, third and NNI neighbourhoods 
of an unrooted binary phylogenetic tree. Let T G UB{n,c) (recall that UB{n,c) is the set of 
unrooted, binary phylogenetics trees with n leaves and c cherries). Then 

1. |iVjviv/(T)| =2(n-3), and 

2. = 2n^ — lOn + 4c, and 

3. \Nfjj^j{T)\ = |n3 - 8n2 - f n + 8cn + Ups^T) + 164, 

where PsiT) is the number of internal paths of length three in T. These results for the first and 
second NNI neighbourhoods were shown by Robinson [22]. It is interesting to compare these 
with the corresponding results for the RF distance. In both cases, the first neighbourhood is 
dependent only on the number of leaves, while the second neighbourhood is determined by the 
number of leaves and cherries. In fact, the size of the first NNI neighbourhood is the same as the 
size of the first RF neighbourhood. Robinson [22] also found an upper bound on the size of the 
third NNI neighbourhood of a binary phylogenetic tree. We use this result to derive the exact 
formula for the size of the third NNI neighbourhood given above (see the proof in Appendix A, 
along with a brief discussion of how to determine p^iT). 

As mentioned previously, tree rearrangement operations are also used to compare trees pro¬ 
duced by different tree reconstruction methods, or trees obtained from different data sets. This 
can be achieved by determining the NNI distance (the smallest number of operations) between 
the two trees. DasGupta et al. [10] showed that the problem of computing the NNI distance 
between two trees in UB{n) is NP-complete. Culik and Wood [9] found an upper bound of 
4nlog(n) on the NNI distance between two trees in UB{n), which was later improved to nlog(n) 
by Li et al. [20]. 

It is also useful to understand the structure of UB{n), and the first and second NNI neigh¬ 
bourhoods of a tree (e.g. how the first NNI neighbours of a tree relate to each other). A walk 
in a graph G is a sequence of vertices and edges, in which the vertices are not necessarily dis¬ 
tinct. Consider a graph G in which each vertex represents a tree in UB(ri) and there is an edge 
between the vertices representing trees Ti and T 2 if they are first NNI neighbours. Bryant [5] 
noted that the length of the shortest walk that visits every vertex of G was unknown. Gordon 
et al. [14] provided a constructive proof that this walk is a Hamiltonian path (a path that visits 
every vertex of G exactly once). Therefore by a series of NNI operations beginning from a tree 
T G UB{n), it is possible to visit each tree in UB(n) exactly once. We refer to this series of NNI 
operations as an NNI walk. In Section 5, we investigate the structure of UB{n) by determining 
the number of pairs of trees that share a first NNI neighbour (the number of pairs of trees that 
are within NNI distance two of each other). 

4.1 Asymptotic Result for the k*^ Neighbourhood Our main result for this section 
is the asymptotic expression for the size of the k*^ NNI neighbourhood of a binary tree given in 
Theorem 4.1. 


Theorem 4.1. Let T G UB{n) (n> 4). Then for each fixed k G Z"*", 


where 


WNNliT)\ 


— (l-f£>T.fcn ^ + 0{n ^)) , 


3k{k + 1) 


^ Dx^k 'Ll ikik — 2 ). 


(4.1) 


2 
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We will prove Theorem 4.1 at the end of Section 4.1. As in the proof of Theorem 3.1, we 
consider the number of NNI neighbours resulting from NNI operations on a given set of k 
internal edges. From Lemma 3.5, we know the number sets of k internal edges of T. Combining 
these gives us the total number of NNI neighbours. The four different cases that are relevant 
are: 

1. The k edges are distinct and pairwise non-adjacent. 

2. The k edges are distinct and exactly two are adjacent. 

3. The k edges are distinct and more than two are adjacent. 

4. The k edges are not all distinct. 

These are the same cases as for RF, with the additional possibility that the k edges are not all 
distinct (Case 4). In Equation 4.1 of Theorem 4.1, the term of order is completely determined 
by Case 1, whereas the term of order is determined by Cases 1 and 2. We show that all 
other possibilities for the k edges (covered by Cases 3 and 4) only contribute to terms of order 
j.jfe -2 Qj, lower. 

Neighbours Resulting from NNI Operations on k Given Edges 

We cannot determine the size of the k*^ NNI neighbourhood by simply counting the number 
of possible sets of k NNI operations, as there are situations where changing the order of the 
operations, or the edges they are on, does not change the resulting tree. We need to determine 
exactly when this occurs in order to accurately count the size of the neighbourhood. There¬ 
fore, some preliminary work is required before we investigate the four cases outlined above. 

First, we consider the impact that a single NNI operation on a tree T, has on the non-trivial 
splits of T. 

Lemma 4.2. Let T G UB{n) (n > 4j and T' = NNI{T] e) where e is an internal edge ofT. 
Then 

|E(r) - S(T')| = |S(T') - E(T)| = 1. 

Furthermore, S(T) — E(r') = {S'(T, e)}, and for all internal edges e' ^ e in T, we have 
S{r,e') = S{T,e'). 

Proof. Note that |E(r)| = |E(T')| as T,T' G UB{n). Let the subtrees that are dis¬ 
tance one from e in T be called A, B, C and D, such that dT{A,B) = 2. Then we have 
S{T, e) = {C{A) U C{B),C{C) U C{D)}. Either A or R is one of the subtrees that are swapped 
by the NNI operation, so dT'{A,B) = 3, and C{A) and C{B) are in different parts of S(T',e). 
Hence S{T, e) ^ S{T', e). 

Suppose there exists an internal edge e' of T, such that e' e. Let S{T,e') = {Li,L 2 }. 
In T, either e' is adjacent to e, or e' is in one of the subtrees A, B, C or D. Therefore, either 
Li or L 2 is a subset of the leaves in one of the subtrees A, B, C or D. Since A, B, C and D are 
the four subtrees of T' that are distance one from e, S{T', e') = S{T, e'). 

Therefore E(r) - E(r') = {S{T, e)} and E(r') - E(T) = {S{T', e)}. Hence 


|E(T) - E(r')| = |E(r') - E(r)| = i. 
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□ 

We now compare the non-trivial splits of trees that are NNI neighbours. 

Lemma 4.3. Let T € UB{n) (n > A), and let ei, ..., (k > 1) be internal edges of 
T such that there exists e-m (A < rn < k) for which Cm ^ {ci,..., Cm-i, Cm+i, e/c}. Let 
T' £ NNL{T-,ei,...,ek). Then 

(i) if e' is an internal edge ofT and e' ^ {ei,...,efc}, then S{T',e') = S(T,e'), and 

(ii) SIT, Cm) is not a split of T'. 

Proof. 

(i) To see that S{T',e') = S{T,e'), consider trees T[, where T[ £ NNL{T;ei), 

T^£NNL{Ti-,e 2 ), ..., € NNL{T^_^-ek-i), and T’ £ NNLlTl^_^-ek). Then by 

Lemma 4.2, 

S{T', e') = S{U_„e') = ■■■ = S{Ti, e') = S{T, e'). 

(ii) Let Tm-i £ NNI{T-,ei, such that T' £ NNL{Tm-i]em, By (i), since 

^ {ei , ..., 6777,— 1 S{Tjyi—\,ejn'^ — S{T,ejxi). Let .A and Ld be two subtrees in T^fi—i 

that are distance one from Cm, where d{A, B) = 2. Then 

S{Tm-i,e^) = S{T, em) = {C{A) U C{B), C{T) - {£{A) U C{B))}. 

The NNI operation on edge Cm in 7m-i swaps either A or B with one of the other 

two subtrees that are distance one from e. Let Tm £ NNL{Tm-i; Cm) such that 

T' e NNI{Tm;em+i, Then C{A) and C(B) are in different parts of S(Tm,em)- 

Since Cm ^ {cm+i, ■■■, e^}, we have S{T', Cm) = S{Tm, Cm) by (i). Hence all leaves in A 
are in a different component of T' \ Cm to the leaves of B. Therefore, S{T, Cm) is not a 
split of T'. 

□ 

Corollary 4.4. Let T £ UB(ji) (n > A) and let ei, Ck (k > 1) be internal edges 

of T such that there exists Cm (A < m < k) for which Cm ^ {ei,..., em-i, Cm+i,..., 6^}. Let 

P = NNL(T-, ei,Cj) and Q = NNL{T-,ej+i,...,ek) where 1 < j <k. Then 


Png = 0. 


Proof Without loss of generality, suppose that 1 < m < j. By Lemma 4.3, S{T,em) is not 
a split of any of the trees in P. 

Additionally by Lemma 4.3, for all T' £ Q, S{T',em) = S{T,em), since Cm ^ {cj+i, ...,efc}. 
Therefore, since two trees are equal if and only if they have the same set of splits, we have 
PnQ = 0. 

□ 

We now consider the number of k*^ NNI neighbours of a tree resulting from a series of k 
NNI operations on a given set of distinct edges. 

Lemma 4.5. Let T £ UB{n) (n > A), and let ei, 62,..., ek (1 < k < n — 3) be distinct 
internal edges ofT. Then NNL(T-, ei, ...,efc) is a subset of N^pfj{T) of size 2*. 
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Proof. For each edge e of T there are two distinct first NNI neighbours resulting from NNI 
operations on e. Since we perform NNI operations on k different edges in T, there are 2^ 

NNI neighbours, provided that none of the resulting trees are equivalent, or in the NNI 
neighbourhood of T for some j < k. 

The latter follows from Corollary 4.4, since ei,..., e?; are distinct. This means that 

iViV/(T;ei,...,efc)CiV^^,(T). 

To show that none of the resulting 2^ trees are equivalent, we consider the splits of these trees. 
Let Tk and T^. be two trees in NNI{T-,ei, ..., et), where at least one operation produced a dif¬ 
ferent first neighbour in each case. In other words, there exist trees £ UB{n) such 

that Tm-i £ NNI{T-,ei,...,em-i), {T^, T;^} C 7ViV/(Tm_i; e^), Tk £ NNI{Tm]em+i, ■■■,ek), 
Tj, £ NNI{Tf^;em+i, and ^ Tm+i- Note that since Tm and are the two dis¬ 

tinct first NNI neighbours of T^-i obtained by an NNI operation on Cm, Tf^ £ NNI{Tm\Cm)- 

Now we consider the splits of T, T^, Tk, andT^. By Lemma 4.3, S{Tm,em) ^ S{Tf^,em), 
as Tm and Tf^ are first NNI neighbours (by an operation on edge Cm)- Since Cm ^ {cm-i-i, e^}, 
S(Tk,em) = S(Tm,em) 7 ^ S{Tm,em) = S{Tl,,em) by Lemma 4.3. Hence Tk ^ Tj.. Therefore, we 
obtain 2^ distinct NNI neighbours from k NNI operations on distinct edges ei,...,efc in order. 
□ 

An important factor to consider, is the effect that the distance between two edges in consec¬ 
utive NNI operations on a tree T has on the the resulting second NNI neighbours. The following 
result is due to Robinson [22] and was originally proved by exhaustion, leaving the details to the 
reader. We provide an alternate proof in Appendix B. 

Lemma 4.6. Let T £ UBin) (n > 5), and let Ci and 62 be distinct internal edges of T. 
Let P = NNL(T;ei,e 2 ) and Q = NNL{T;e 2 ,ei). If ei and 62 are adjacent then P fl Q = 0; 
otherwise, P = Q. 

This leads naturally to the following corollary. 

Corollary 4.7. Let T £ UB{n) (n > A), and let ei, Ck (k >1) be internal edges ofT. 
Let 


P — NNI{T, Ci, ..., Cm, Cm-t-l: •••, 6/c)j 

Cj — NNI(T, Cl, ..., Cm+l, ^m, 6/c)- 

^f 6m and Cm+i o.re distinct and non-adjacent, then P = Q. 

Proof Let Tm-i be a tree in NNI{T;ei, ..., Cm-i)- By Lemma 4.6, 

NNI{Tm-l', Cm, e.m+l) = NNI{Tm- 1) ; ^m) • 

This is true for any choice of Tm-i, so 

NNIiT] ei , ..., 6771 — 1 ) C 777 ,, Gm+i) — NNI(T ; ei , 6777 — 1 , ^m+l? C 777 ). 

Therefore P = Q. 

□ 
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Lemma 4.6 and Corollary 4.7 tell us how the distance between the two edges in successive 
NNI operations affects the resulting second NNI neighbours. Corollary 4.4 justifies that different 
choices of edges for the two NNI operations do not produce any duplicate second NNI neighbours. 
Lemma 4.5 tells us the number of second NNI neighbours resulting from NNI operations on a 
given set of internal edges of a tree in order. We now have enough information to present some 
results on the number of neighbours resulting from NNI operations on a given (unordered) set 
of internal edges. 

Lemma 4.8. Let T e UB{n) (n > 4). 

(i) For any given set of k distinct, pairwise non-adjacent internal edges (1 < k < n — S), 
there are 2^ k*^ neighbours ofT resulting from NNI operations on this sequence of edges 
in any order. 

(a) For any given set of k distinct internal edges (2 < k < n — 2) where exactly one pair 
is adjacent, there are 2^+^ neighbours of T resulting from NNI operations on this 
sequence edges in any order. 

(Hi) For a given T and a given sequence of k (not necessarily distinct) edges of T (k > 1), 
the number of k*^ NNI neighbours resulting from NNI operations on this sequence edges 
in any order is constant with respect to n. 

Proof. 

(i) Suppose we perform NNI operations on k distinct, pairwise non-adjacent internal edges 
Cl, ..., Cfe of T. Lemma 4.5 tells us that if the NNI operations are performed in a given 
order we obtain 2^ k*^ neighbours. Since the edges are pairwise non-adjacent, by Corol¬ 
lary 4.7, changing the order of the operations does not change the set of trees produced. 
Hence, there are 2^ neighbours of T resulting from NNI operations on this set of edges 
in any order. 

(ii) The only difference between this and (i) is the pair of adjacent edges and Cj (where 
1 < * < j < fc)- By repeated applications of Corollary 4.7, 

NNIfF, Cl, ..., Cj,..., Cj,..., — NNI(T, ei,..., e^—i, ,..., cj—i, ..., Ci, ). 

As in (i), by Lemma 4.5 and Corollary 4.7, performing NNI operations on the edges 
ei, ..., Ci-i, Ci+i, ..., Cj-i , Bj+i, ..., Bk in any given order produces the set of trees 

NNI(T, Cl,..., Cj—i, ,..., Cj—x, ,.■., Cfc), 


where 


\NNI{T-, Cl,..., Ci-i, Ci+i,..., Cj-i, Cj+i,..., efe)| = 2^ 

Let Tk -2 G NNI{T;ei,...,ei-i,ei+i,...,ej-i,ej+i,...,ek). By Lemma 4.6, 

NNI{Tk-2-,ei,ej) n NNI(Tk-2-,ej,ei) = 0 . 

Therefore, since 

\NNI{Tk-2;ei,ej) U NNI(Tk-2-,ej,ei)\ = 4-|-4 = 8 , 

we have 8(2*“^) = 2^+^, k*^ NNI neighbours of T. 
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(iii) Let F be the subgraph of T consisting of the edges ei,Cfc. Then F has m components, 
Cl,...,Cm (1 < m < k). Edges in different components of F are not adjacent, so by 
Corollary 4.7, the order in which we perform NNI operations on them does not change 
the resulting neighbours. However, by Lemma 4.6, the order of NNI operations on the 
edges that form a component of F, does change the resulting neighbours. Therefore the 
number of neighbours resulting from NNI operations on the k edges is 

m 

£=1 

where f{Ci) is the number of distinct k^^ NNI neighbours resulting from NNI operations 
in T on the edges from ei,...,e/c that are in component Ci of F (more than one NNI 
operation may be on the same edge). We consider each component separately. 

Let Cp, I < p < m be a component of F with q edges and consider calculating f{Cp). 
Let fi, ...,fj {j < k) be the subsequence of the edges ei,..., that are in Cp. Note that 
the edges fi,...,fj are not necessarily distinct. Add pendant edges incident to vertices 
in V{Cp), so that all of the vertices in V{Cp) have degree three. The resulting tree Cp 
is an unrooted binary tree with q + 3 leaves. The internal edges of Cp are the distinct 
edges of the sequence fi,..., fj. Then f{Cp) is equivalent to the number of distinct 
NNI neighbours of Cp resulting from NNI operations on the edges /i, ..., fj of C'. The 
number of neighbours f{Cp) from these operations depends only on the shape and 
size of Cp, the number of times we perform an NNI operation on each internal edge of Cp, 
and the order in which the operations are performed. All of these factors are determined 
by the choice of the edges ei,..., Cfc of T. Therefore, given a tree T and internal edges 
ei,..., efc of r, the number of k*^ NNI neighbours of T resulting from NNI operations on 
the edges ei,..., in any order is independent of n. 

□ 

Before we consider the number of sets of k edges we need two more results, which tell us 
that the trees resulting from k NNI operations on less than k pairwise non-adjacent internal 
edges of a tree, are not k*^ neighbours (the case where the k edges are not all distinct). This is 
important, as if some of these trees are neighbours, then this case contributes to the 0{n~^) 
term in Theorem 4.1. First, in Lemma 4.9, we consider two consecutive NNI operations on the 
same edge. Then, in Corollary 4.10 we consider two non-consecutive NNI operation on the same 
edge. 

Lemma 4.9. Let T G UB{n) (n > -i), and let ei,..., Ck (k > 2) be internal edges ofT. 
Suppose there exists an m (l<m<k — 1) for which Cm = Cm+i- Let 

P = NNI{T;ei , ..., 677X, 6777,-(_1,6777-1-2 7 • ■ • 7 ) ■ 

(i) If the NNI operation on edge Cm+i is the inverse of the operation on edge Cm, then 

P — NNI{fF , Cl, ..., Cm—l-: 6m+2: ^fc)- 

(ii) If the NNI operation on edge Cm+i is not the inverse of the operation on edge Cm, then 

P — NNI(fF , Cl , ..., Cmi ^m+25 ■•■5 Cfc)- 


Furthermore, P fl = 0. 
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Proof. Let T^-i G iViV/(T; d,e^-i) and let G NNI{Tjn-i; Cm), Tm ^ T^. 

Now suppose we perform an operation on edge Cm+i = em in Tm to obtain a tree T^+i- Let 
T' G NNI{Tm+l] em+ 2 , Gfe). 

(i) First, suppose the operation on edge em+i is the inverse of the operation on edge Cm- 
Then T^+i = Tm-i- Hence P = NNI{T;ei,Cm-i, em+ 2 , 

(ii) Now suppose that the operation on edge em+i is not the inverse of the operation on 
edge Cm- Then Tm+i = Hence P = NNI{T-,ei, ...,em,em+ 2 , 

It follows from {%) and {ii) that P fl = 0. 

□ 

Corollary 4.10. Let T G UB{n) (n > A), and let ei, ek (k > 2) be internal edges 
ofT. Suppose that for some m,j, where 1 < m < j < k. Let 

P — NNL{P^ Cl,..., em, em+i, • ■ • ? 15 ek), 

Q — NNL{P^ Cl,..., em—i, em+i ? ej—i, Cj,..., ek) 

R — LINL{P^ Cl,..., em—i, em+i, ej—i, ej+i ,..., ek). 

Suppose that the edges em+i, ■■■,ej-i are non-adjacent to em- If the operation on edge ej is the 
inverse of the operation on edge em, then P = R; otherwise, P = Q. 

Proof By Corollary 4.7, 

P — LIJVI{P, Cl,..., em—i, em+i, em, - ej—i, ej ,..., ek) 

= NNI{T-ei , em—i, em+1, em+2, em, ej—1, ej ,..., ek) 

— NNL{P, Cl,..., em—i, em+i, - ej—i, em, ej,..., ek). 

Therefore, by Lemma 4.9, if the operation on edge ej is the inverse of the operation on edge Cm, 
P = R; otherwise, P = Q. 

□ 

Now we have all of the information required to prove Theorem 4.1. 

Proof of Theorem 4.1 We break down the calculation of the size of the k^^ NNI neigh¬ 
bourhood of T into two steps. First we consider the number of NNI neighbours resulting 
from k NNI operations on a given sequence of k edges of T. We then consider the number of 
ways these k edges can be chosen in T. By Lemma 4.8, the number of k*^ NNI neighbours of 
a given tree T resulting from operations on a given sequence of k edges is not dependent on n. 
Hence, only the number of ways of choosing these k edges is dependent on n. We consider two 
cases. 

First, assume that the k edges are all distinct, and consider the number of ways they can be 
chosen in T. By Lemma 3.5 the case where the k edges are pairwise non-adjacent (Case 1 from 
the beginning of this subsection) gives a term of order and a term of order . The case 
where exactly two of the k edges are adjacent (Case 2) produces a term of order , but not 
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a term of order . If more than two of the k edges are adjacent then the highest order term 
is 0 (n'=- 2 ). 


Now suppose that the k edges are not all distinct. By Lemma 3.5, if fc — 1 of the k edges 
are distinct and pairwise non-adjacent, the highest order term is 0{n^~^). However, by Corol¬ 
lary 4.10, the trees produced by this are not k*^ NNI neighbours of T. By Lemma 3.5, if more 
than two of the k edges are the same or if more than two are adjacent, the highest order term 
is 


In the case where the edges are pairwise non-adjacent, by Lemma 4.8, there are 2^ NNI 
neighbours of T resulting from NNI operations on a given set of k edges. In the case where exactly 
two edges are adjacent there are 2^+^ resulting k*^ NNI neighbours. Hence by Lemma 3.5, 






fc(5fc+ 1) fc -1 
2k\ 

3k{k + I) f. j. 


2^ + 


I 


fe—Infe+l 


2k\ 


2^n'^-^ + 


2{k-2)l 


+ 0(n'=-^) 




I i, k{k -I- 2) 


k\ 


n 


2 ^ + 


,k-l 


{k-2)\ 

= —n H-r-;-2 n + U{n ). 


2'=+! -b 0(n'=-2) 


A:! 


fc! 


D 

We can see that this result is very similar to the size of the RF neighbourhood, as Hr.fc 
and Cr.k are both quadratic in k. 


5 Pairs of Trees with Shared Neighbours In this section we calculate the number of 
pairs of binary phylogenetic trees with n leaves that share a first NNI or RF neighbour. Equiv¬ 
alently, the number of pairs of trees that are within at most distance two of each other. 

Our calculation involves summing the size of the first and second neighbourhoods of a tree, 
over all binary phylogenetic trees, and discounting any duplicate trees. However, the size of the 
second neighbourhood for both NNI and RF is dependent on the number of cherries, by Bryant 
and Steel [ 6 ] and Robinson [22]. Therefore it is necessary to know the number of binary phylo¬ 
genetic trees with n leaves and c cherries, \UB{n, c)|. Hendy and Penny [16] found an expression 
for \UB(n,c)\, which they proved using induction on the number of leaves. Here we present a 
constructive proof of their result. 


Proposition 5.1. For all n > 4, 


\UB{n,c)\ 


n\{n — 4)! 

c!(c-2)!(n-2c)!22'=-2’ 


for 2 < c < j, and \UB{n, c)| = 0 otherwise. 
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Proof. The tree with the smallest number of cherries is a caterpillar, which has two cherries. 
Since there are two leaves in a cherry, the maximum number of cherries a tree can have is 
Hence for c < 2 or c > ^ we have \UB{n, c)| =0. 


Let 2 < c < Each T £ UB{n, c) has 2c leaves that are in cherries. The number of ways 
of choosing the 2c leaves of T to form the c cherries is (^^). From those 2c leaves we choose 
two for each cherry. We divide by c! since the ordering of the cherries is not important. (Note 
that this is the same as the number of perfect matchings on a complete graph with 2c vertices.) 
Therefore, the number of ways of choosing c cherries from n leaves is 


M = 


n\ (2c)! 
2 c) c!2^ 


n\ 

c\(n — 2c)!2'= 


Now consider each cherry as a single leaf with the labels of both leaves. There are c of these 
double-labelled leaves and n — 2c other leaves. We determine the number of trees that can be 
formed with these leaves. We have the restriction that no pair of the n — 2c single-labelled leaves 
can be in a cherry. Therefore, we will first consider the number of trees we can form with only 
the c double-labelled leaves. This number, P, is given in Lemma 2.2, 


P=\UB{c)\ 


(2c-4)1 

(c-2)!2^-2' 


Now let T be one of these trees with c double-labelled leaves. We insert the remaining n — 2c 
single-labelled leaves. Each single-labelled leaf can only be joined to edges in E{T), so as not to 
create another cherry. There are 2c — 3 edges in E{T) to which the single-labelled leaves could 
be joined. Since there are no other restrictions on where these single-labelled leaves must be 
inserted, we simply need to count the number of distinct trees resulting from joining the n — 2c 
single-labelled edges to edges in E{T). This is the product of the number of ways to place n — 2c 
items into 2c — 3 bins, and the number of ways to order the n — 2c items. The number of distinct 
trees is given by 


Q = {n — 2c)! 
= [n — 2c)! 

Combining M, P, and Q, we have 


/(n - 2c)-L (2c - 3) - A 
V (2c-3)-l ) 

/ n — 4 \ (n — 4)! 
V2c-4y “ (2c-4)!’ 


\UB[n,c)\ = MPQ 


n\ (2c —4)! (n — 4)! 

c!(n-2c)!2'= ' (c-2)!2^-2 ‘ (2c-4)! 
n!(n — 4)! 

c!(c- 2)!(n- 2c)!22c-2- 


□ 

We can now use this result to find the number of pairs of binary phylogenetic trees in UB(ri) 
that are within at most distance two of each other under NNI and RE. For 9 £ {NNI,RP}, 
define 


Nf in) = {(T,r') : T,T' G 17P(n), de(r, T') < k}. 
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Corollary 5.2. Let n > 3, Then 
(i) \Nf^j{n)\ = Eili \UB{n, c)\in^ - 4n + 2c - 3). 
(a) |iv||(n)| = Y}cli \UB{n,c)\{n‘^ - 3n + 3c- 9). 


Proof. 

(i) For T G UB{n,c), the number of first and second NNI neighbours is 

Nnni{T) + = 2(n — 3) + 2n^ — lOn + 4c 

= 2n^ — 8n + 4c — 6. 


To hnd the number of pairs of trees in UB{n) that are within NNI distance two, we 
simply sum the number of first and second neighbours over all trees in UB{n), and then 
halve the result as each pair will be counted twice. So, 


= ^^|CfS(n,c)|(2n2-8n + 4c-6) 

^ c ^2 

LfJ 

= \UB{n, c)|(n^ - 4n + 2c - 3). 

c=2 

Proposition 5.1 gives us \UB{n,c)\. 


(ii) For each unrooted binary tree T, the number of first and second RF neighbours is 

Nrf{T) + Nlp{T) = 2(n - 3) + 271 ^ _ 8n + 6c - 12 
= 2n^ — 6n + 6c — 18. 


Therefore 

. LfJ 

|N^^(n)| = - \UB{n, c)|(2n^ — 6n + 6c — 18) 

^ c=2 

Lfl 

= 'Y^ \UB{n, c)|(n^ — 3n + 3c — 9). 

c=2 


□ 

6 Subtree Prune and Regraft In this section, we show that unlike RF and NNI, the 
size of the second Subtree Prune and Regraft (SPR) neighbourhood of a tree T G UB{n) is not 
uniquely determined by the number of leaves and cherries of T. 

An SPR operation on a tree T G UB{n) is defined by the following process: 

1. Choose an edge e = {u, z;} G E{T) and delete it, leaving two components (containing 

the vertex u) and Ty (containing the vertex v). 

2. Choose an edge / G E{Ty) and subdivide / with a new vertex w to obtain two edges /i 
and / 2 . The vertex w has degree two. 

3. Insert the edge g = {w, zt} and suppress the vertex v to obtain a binary tree T' G UB{n). 
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Essentially, we prune the subtree T„ and regraft it onto edge /. We refer to e as the cut edge 
and / as the join edge of the SPR operation (see Fig. 6.1). The tree T' is a first SPR neighbour 
of T. We will use the notation SPR{T, (e, /)) to refer to the tree obtained by an SPR operation 
on tree T with cut edge e and join edge /. Note that if drie, f) = 1, then T' is a first NNI 
neighbour of T [24]. 



T' 


C 



D 

Tu 


Fig. 6.1. An example of an SPR operation with eut edge e and join edge f. 

Consider a graph G in which each vertex represents a tree in UB{n) and there is an edge 
between the vertices representing trees Ti and T 2 if they are first SPR neighbours. The SPR 
distance between Ti and T 2 , Sspr{Ti,T 2 ), is the distance between the two vertices representing 
Ti and T 2 in G. 

For a tree T G UB{n), the size of the first SPR neighbourhood is given by 

\NsPRiT)\ = 2{n-3)i2n-7). 

This was determined by Allen and Steel [1]. No other SPR neighbourhood sizes are currently 
known. 

In relation to the structure of the SPR neighbourhood, Carceres et al. [8] provided tight 
bounds on the length of the shortest NNI walk that visits all trees in the first SPR neighbour¬ 
hood of a tree T. Allen and Steel [1] found upper and lower bounds for the maximum SPR 
distance between any two trees in UB{n). 

As with NNI and RF, the size of the first SPR neighbourhood of a tree depends only on 
the number of leaves in the tree. However, unlike NNI and RF, the size of the second SPR 
neighbourhood of a tree cannot be expressed solely in terms of the number of leaves and cherries 
of the tree. In this section we show that these two parameters are not sufficient to determine 
even the highest order term of the size of the second SPR neighbourhood. At the end of this 
section we prove our main results, which are presented in Theorems 6.1 and 6.2. 

Theorem 6.1. Let T G UB{n). 

(i) If T is a caterpillar, then 

\^SPRi'^) \ = 2 ^* + 0{n^). 

(ii) If T is a balanced tree, then 

\^SPRi'^) \ = 2 ^^ + 0{n^). 
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It is evident from Theorem 6.1 that the size of the second SPR neighbourhood of a tree T 
is not uniquely determined by the number of leaves of T. However, every caterpillar has exactly 
two cherries, whereas a balanced tree with at least six leaves has at least three cherries. There¬ 
fore, for n > 6, a caterpillar and a balanced tree, each with n leaves, have different numbers of 
cherries. Therefore Theorem 6.1 does not justify that the size of the second SPR neighbourhood 
of T cannot be uniquely determined by the number of leaves and cherries of T. To show this, 
we consider two different structures of an unrooted binary tree T with n = 3m (m > 3) leaves 
and three cherries. These two tree structures (Type I and Type II) can be seen in Fig. 6.2 and 
Fig. 6.3 respectively. Similar to Theorem 6.1, we show that trees of Type I and Type II also have 
a different highest order term in the expression for the size of the second SPR neighbourhood. 
This result is presented in Theorem 6.2. 



Fig. 6.2. A Type I tree with three cherries and n = 3m leaves (m > 3). 



Fig. 6.3. A Type II tree with three cherries and n = 3m leaves (m > 3). 

Theorem 6.2. Let Ti and T 2 be unrooted binary trees with n = 3m leaves (m >3^ and 
three cherries, and suppose that Ti is of Type I and T 2 is of Type II. Then 

I^Ipr('7"i)I = + 0 {n^), and 


OQ 

\NlpR{T2)\ = -n^ + 0{n^). 


We will use the notation 


SPR{T, (ci, ji), (c 2 , J 2 ), {ck,jk)) 
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to denote the tree obtained by k successive SPR operations starting with tree T, where ci and ji 
in T are the cut and join edges respectively, of the first operation, C 2 and j 2 in SPRiT, (ci, ji)) 
are the cut and join edges of the second operation, and so on. When fc = 2, we refer to the two 
operations that result in the set of trees SPR{T, (ci, ji), ( 02 ,^ 2 )) as a pair of SPR operations. 
It is worth noting that some of these cut and join edges may not be edges of T if they are 
created by one of the SPR operations. However, the results in this section will require only sets 
of ‘well-separated’ edges, where the cut and join edges are pairwise at least distance three apart, 
so that all of the edges ci,..., and ji, ...,jk are edges of T. 

First, we determine an upper bound on the size of the second SPR neighbourhood. This 
follows directly from the expression for the size of the first SPR neighbourhood given by Allen 
and Steel [1]. 

Corollary 6.3. Let T s UB{n) {n> 3). Then 

< 4(n - (2n - = 0(n4). 


The first step in proving Theorems 6.1 and 6.2 is to determine whether or not all pairs of 
SPR operations contribnte to the term of order in the expression for the size of the second 
SPR neighbourhood of a tree. 

Let T G UB{n) and let 


T(P) = {(ci,C2, ji,j2) : ci,ji G E{T),ci yf ji;c2,j2 G E{SPR(T, {ci,ji))),C 2 j 2 }- 

This is the set of all possible choices for the four cut and join edges of two SPR operations 
starting with tree T. 

Let S(T) be the subset of T(T) where C2, j2 G E{T) and the four edges ci, ji, C2, j 2 are 
pairwise at least distance three apart in T. 

The following lemma shows that in order to prove Theorems 6.1 and 6 . 2 , it suffices to con¬ 
sider only pairs of SPR operations with cut and join edges in S(T). 

Lemma 6 . 4 . LetT G UB{n). Then 

HT)\ = ‘^-n^ + 0{n^) 

|T(T) -S(T)| =0(n3). 


Proof For sufficiently large values of n, it is possible to choose the edges ci, ji, C 2 and j 2 in 
T such that (ci,C2,ji, j2) G §(T). To determine the size of S(r), we count the number of sets 
of four internal edges of T, where all pairs of edges in the set are at least distance three apart. 
There are 2n — 3 choices for edge ci, since this is the number of edges in T (this follows from 
Lemma 2.2). The maximum number of choices for ji is (2n — 3 — 7) (this can occur if ci is a 
pendant edge). The minimum number of choices for edge ji is (2n — 3 — 29) (this can occur if Ci 
is an internal edge). The maximum number of choices for C 2 is (2n — 3 — 7 — 6) (this can occur 
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if Cl and ji are both pendant edges). The minimum number of choices for C 2 is (2n — 3 — 2(29)) 
(this can occur if both ci and ji are internal edges). A similar process determines upper and 
lower bounds on the number of choices for edge j^- We divide by the number of ways to order 
the four edges. Therefore 

|§(T)| > ^(2n - 3)(2n - 3 - 29)(2n - 3 - 2(29))(2n - 3 - 3(29)) = + O(n^), and 

\HT)\ < ^{2n - 3)(2u - 3 - 7)(2n - 3 - 7 - 6)(2n - 3 - 7 - 2(6)) = + 0{n^). 

We now consider T(T) — §(T). Determining |T(T) — S(T)| is similar to determining |S(T)|, 
however for at least one of the four cut and join edges, instead of counting the number of edges 
at least distance three from those already chosen, we count the number within distance two of 
those already chosen, and therefore obtain a constant factor instead of a linear factor. Let M be 
a maximal subset of the the edges {ci, C 2 , ji, ^ 2 } such that the edges in M are pairwise distance 
at least three apart in T, where \M\ = m < A. Suppose we first choose the edges in M. From the 
argument above we can see that the number of such choices is 0{n"^). The remaining 4 — m > 1 
edges must be chosen from edges within distance two of those already chosen. The number of 
these choices depends only on the number and location of the m edges already chosen, and not 
on n. Hence 

|S(r)| = + 0{n^), and |T(T) - S(T)| = 0{n^). 

(j 

□ 

Lemma 6.4 tells us that the highest order term in the expression for the size of §(T) is 0{n'^). 
Note that instead of requiring the edges in S(T) to be at least distance three apart, we could 
have made them distance k apart for any k € Z+ and Lemma 6.4 would still hold. We have 
chosen to consider distance three, because if pairs of these four edges are within distance two 
of each other, then there are more cases to consider in order to determine exactly when two 
different pairs of SPR operations produce the same tree. To determine only the O(n^) term in 
the expression for the size of the second SPR neighbourhood, we can ignore all cases where there 
exist edges e, / G {ci, C 2 , ji,^ 2 } such that drie, f) < 2. 

However, we cannot simply take the highest order term in the expression for the size of S(T) 
as the highest order term in the expression for the size of the second SPR neighbourhood of a 
tree T. This is because there may be cases where two different pairs of SPR operations produce 
the same tree (duplicates), or when a pair of SPR operations produces a first SPR neighbour of T. 
To prove Theorem 6.1 and Theorem 6.2 we need to know precisely when these two situations arise. 

In Lemma 6.6 and Lemma 6.7 we show that a pair of SPR operations with cut and join edges 
in S(T) never yields a first SPR neighbour. We also consider when it is possible for two pairs of 
SPR operation to produce the same tree. We consider different cases for the relative locations 
of the four cut and join edges. There are three cases to consider: 

1. The edges j 2 , ci, C 2 , ji lie on a path in T in this order. 

2. The edges C 2 ,ci,j 2 jji he on a path in T in this order. 

3. If the four cut and join edges lie on a path in T, then they are in an order other than 
those given in Cases (1) and (2). 

Note that Cases (1) and (2) are characterised by the cut edge Ci, of the first operation, being 

on the path between the cut and join edges C 2 and ^' 2 , of the second operation. In both of these 
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cases, the first operation changes which internal vertices form the endpoints of the (ci, ji)-path. 
In Case (3), the first operation does not change these endpoints. In Lemma 6.7, we see that in 
Cases (2) and (3), 


SPR{T, (Ci, J1)(C2, J2)) = SPR{T, (C2, J2)(C1, Jl)). 

However, in Lemma 6.6, we show that in Case (1), 

SPR{T, (Ci, J1)(C2, J2)) ^ SPR{T, (C2, J2)(C1, Jl)). 

For all three cases, we see that no other pair of SPR operations can yield the tree 
SPR{T, (Ci,ji)(c 2 ,j 2 )). 


First, we require a result about how SPR operations on a tree T G UB{n) with subtrees 
A and B can result in a tree T' where dT'{A,B) < dT{A,B). Recall that an internal subtree 
always has at least two vertices of degree two. Therefore, an internal subtree must have at least 
one internal edge. 

Lemma 6.5. Let T G UB{n). Suppose there exist subtrees A and B of T (not necessarily 
pendant or maximal), such that dT{A,B) = k. Let a and b be vertices of degree two in A and 
B respectively, such that dT{a,b) = k. Call the two pendant edges of the (a — b)-path P, e 
and f respectively. Let T' € UB{n) and suppose that A and B are subtrees of T' such that 
dT'{A,B) = 2. Let ci, ji, C 2 , and j 2 be edges ofT. 

(i) Fork > 4, if T' = SPR{T, (ci, ji)), then ci G {e,/} and ji is incident to A or B. 

(ii) For k > 5, if T' = SPR{T,(ci,ji),{c 2 ,j 2 )) with (ci,C 2 ,ji, j 2 ) G S{T), then either 
Cl G {e, /} and ji is incident to A or B, or C 2 G {e, /} and j 2 is incident to A or B. 

(Hi) For k > A, if dT'{a,h) = 2 and T' = SPRifT, (ci, ji)), then {ci, ji} = {e, f}. 

(iv) For k>5, if dT'\a,h) = 2 and T' = SPr\t, (ci, ji), { 02 , 32 )) with (ci, C 2 , ji, J 2 ) e §(?’), 
then {ci, jl} = {e,/} or {c 2 , j 2 } = {e,/}. 

Proof. 

Fig. 6.4 shows trees T and T'. 



Fig. 6.4. Trees T and T' from Lemma 6.5. Here we have T' with dj'/{a, b) = 2. 
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(i) Let T" = SPR{T, (ci, ji)). We show that T” ^ T' unless ci G {e, /} and ji is incident 
to A or B. First suppose that edge ci is in subtree A or B. Then either T" = T (if we 
regraft in the same place), or T" ^ T' since the subtree {A or B) is not in T". Likewise 
for edge ji. 

Now suppose that ci and ji are not edges of A or B. Hence A and B are both subtrees 
of T”. If we assume that edge ci is not in P or incident to P, then dT"{A,B) > 5 if 
ji is an edge of P; otherwise, dT"{A,B) > 4. Therefore T" ^ T'. If ci is incident to 
P then deleting edge ci creates a vertex of degree two in P, which is suppressed by the 
SPR operation. Hence dT"{A,B) > 4 if ji is an edge of P; otherwise, dT"{A,B) > 3. 
Therefore T" 7 ^ T'. 

Now suppose that ji is not an edge of A or B, and ci is an edge of P. If ci ^ {e, /} then 
dr" {A, P) > 3 if ji is incident to A or B; otherwise, dr" {A, B) > 4. Therefore T" 7 ^ T'. 
Finally suppose that ci G {e, /}, and ji is not incident to A or B. Then dT"{A, B) > 3 
and T" 7 ^ T'. The only remaining possibility is that ci G {e, /} and j'l is incident to 
either A or B. 

(ii) Let T" = S'Pi?(T, (ci, ji)(c 2 , J 2 )), where (ci,C 2 , ji, J 2 ) G §(P). We show that T" 7 ^ T' 
unless either ci G {e, /} and j'l is incident to A or P, or C 2 G {e, /} and j 2 is incident to 
A or P. As in (i) if any of the four cut and join edges are in the subtrees A or P in T, 
then that subtree is not a subtree of T", so T" 7 ^ T'. In {i) we saw that if the cut edge 
of an operation is not in P or incident to P in T, then the operation does not reduce 
the distance between A and P. As in (i), an operation with a cut edge incident to P, 
reduces the distance between A and P by at most one. Hence if neither cut edge ci nor 
C 2 is in P, we have dr" {A, B) > 3, and so T" 7 ^ T'. 

Now we assume that at least one of the edges ci and C 2 is an edge of P. Suppose that 
Cl is not an edge of P, but C 2 is. Then if Pi = SPR(T, (ci,ji)), dT^iA, B) > 4. By (i), 
if T" = T' then C 2 G {e, /} and j 2 is incident to either A or P. 

Now suppose that ci ^ {e,/} is an edge of P. Then as in (i), dTi(A,B) > 3 if ji is 
incident to A or P, and dTi{A,B) > 4 otherwise. If dTiiA,B) > 4 then by (i), un¬ 
less C 2 G {e, /} and j 2 is incident to A or P, the second operation cannot result in T'. 
If dTi{A,B) = 3, then since (ci,C 2 ,j’l,J 2 ) G §(P), the edges C 2 and j 2 cannot be in or 
incident to the shortest path between A and P in Pi. Hence dr"{A, B) = 3, and P" 7 ^ T'. 

Finally, suppose that ci G {e,/}. If ji is not incident to A or P then in the tree 
Pi = SPR{T, (ci,ji)) we have dTi{A,B) > 3. Again, if dT^iA, B) > 4 then by (i), the 
second operation cannot result in P' unless C 2 G {e, /} and j 2 is incident to A or P. If 
dTi(A,B) = 3 then since (ci, C 2 , ji, jd) G S(P), the edges C 2 and j 2 cannot be in or inci¬ 
dent to the shortest path between A and P in Pi. Hence dr"{A, B) = 3, and P" 7 ^ T'. 
The only remaining possibility is that ji is incident to either A or P. 

(iii) Now we have dT'{a,b) = 2. Let P" = SPR{T,{ci,ji)). By (i), if P" = P' then 
Cl G {e, /} and ji is incident to either A or P. If ji ^ {e, /} (which can occur if A or P 
is an internal subtree) then dT"{A,B) = 2 but dT"{a,b) > 2, since the internal subtree 
has at least one internal edge. Hence if T" = P' then {ci, ji} = {e, /}. 
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(iv) Let T" = SPR{T, (ci, ji)(c 2 ,^ 2 )), where (ci, C 2 , ji, J 2 ) S S{T). By {ii), if T" = T', then 
either ci G {e, /} and ji is incident to A or B, or C 2 G {e, /} and j 2 is incident to A or B. 

Suppose that ci G {e,/} and ji is incident to A or B. Let Ti = SPR{T, (ci, ji)). If 
ji ^ {e,/}, then dT^{A,B) = 2, but dTi{a,b) > 2. This is because a and b are not the 
endpoints of the shortest path between A and B in Ti. Now, C 2 and j 2 cannot be edges 
in A or B, and (ci, € 2 , 31 , 32 ) G §(r). Therefore, C 2 and 32 cannot be edges on or incident 
to the path between a and b in Ti. Therefore dT"{a,h) > 2, and T" 7 ^ T'. Hence, we 
must have = {e,/}. 

Now suppose that C 2 G {e, /}, and 32 is incident to A or B. If ci is not an edge of P, 
then as in (i), dr^ [A, B) > 4. Therefore, by {in), if T" = T' then {c 2 , 32 } = {e, /}. Now 
suppose that ci is an edge of P. If a and b are the endpoints of the shortest path between 
A and B in Ti, then dTi{A,B) > 6, and so by {in), if T" = T' then {c 2 ,j 2 } = {e,f}- 
Now assume that a and b are not the endpoints of the shortest path P' between A and B 
in Ti. Then exactly one of the edges e or / is an edge of P'. Without loss of generality, 
suppose that e is an edge in P'. Then if T" = T', C 2 = e and 32 is incident to B. If 
32 ^ f, then we have dT"{A,B) = 2, but dT"{a,b) > 2. Therefore T" ^ T'. Hence if 
T" = T', then { 02 , 32 } = {e,/}■ 

□ 

Lemma 6 . 6 . Let T G UB{n), and suppose that we have trees T' = SPR{T,{ci,ji)) and 
T" = SPR{T,{ci, 3 i),{c 2 ,j 2 )) where {ci,C 2 , ji, 32 ) G S(T). Suppose that the edges j 2 , ci, C 2 , and 
ji lie on a path in T in this order. Then 
(i) T" ^ Nspr{T), and 

(a) for all other choices of edges (4, C 2 , jJ, Ja) G S{T) where (4, C 2 , , J 2 ) ^ { 01 , 02 , 31 , 32 ), 

we have 


T"^SPR{T, {c[,3[),{c'2,32))- 

Proof. Since the four cut and join edges lie on a path in T, the rest of the tree can be 
partitioned into five subtrees (two pendant and three internal) connected by these four edges. 

Consider the forest T \ {ci, ji, ca, ja}- It has components A, B, C, D, and E which are 
subtrees of T. Edge ja is incident to A and B, edge ci is incident to B and C, edge ca is incident 
to C and D, and edge ji is incident to D and E. Fig. 6.5 shows T, T' and T" . Each of the 
internal subtrees B, C and D have at least three internal edges, as all pairs of the four cut and 
join edges are at least distance three apart. Let b be the endpoint of ci that is in B, and c be 
the endpoint of ca that is in C. 

(i) From the above, we have dT{B,E) = dT{b,E) > 9 and dT"{B,E) = dT"{b,E) = 2. 
Therefore if T" is a first SPR neighbour of T, then either T" = SPR{T, {ci,ji)) = T' or 
T" = SPR{T, (ji, Cl)) by Lemma 6.5 {Hi) We have dx" {A, C) = 2 and dx' {A, C) > 10, 
so T" ^ T'. In Ti = SPR{T, { 31 , 01 )), dxAAC) > 6 , so T" 7 ^ Ti. Therefore T" is not 
a first SPR neighbour of T. 


^Note that Lemma 6.5 applies when dT{B,E) = dT{b,x) > 9 and dj^n (B, E) = dq^,i{b,x) = 2 where x is a 
vertex of degree two in E. However, since E is & pendant subtree with only one vertex of degree two, we simply 
use dT{byE) instead of dT(b,x) for simplicity. This occurs in other places throughout the proofs of Lemma 6.6 
and Lemma 6.7. 
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Fig. 6.5. TreeT' = SPR{T,{ci,ji)) and T" = SPR{T,{ci,ji),{c 2 ,j 2 )). 


(ii) Considering (c'^,C2, jj, j^) & ^{T), suppose that we have Ti = SPR{T,{c[,j[)) and 
T2 = SPRiT,{c[J[), {c'2,j2)). We show that if T2 = T", = (ci, C2, ji, j2)- 

As before, we have dT"{B,E) = dT"{b,E) = 2. Since dT{B,E) = drih^E) > 9 , if 
T" = T2, then the cut and join edges for one of the operations must be ci and ji by 
Lemma 6.5 (iv). There are four cases to consider. 

(a) First suppose that = (ci,ji)- Then Ti = T'. Since all SPR operations 

on T' result in distinct neighbours (see [ 1 ]), T" = SPR{T, (02,^2)) only if 

(C2.J2) = {C 2 ,j 2 )- 

(b) Now suppose that (c^, = (ji, ci). Then dr^ (B, C) = dr^ {B, E) = dr^ {C, E) = 2 . 

The edges C2 and J2 must be distance three or more from ci and ji in T. If C2 
is in one of the subtrees B, C and E, then this subtree is not a subtree of T2, 
and T2 ^ T". Similarly, if is in one of these three subtrees, then T2 7^ T". 
If neither c'2 or J2 are in one of the subtrees B, E or C, then we know that 
dr^ (B, C) = dr^ (B, E) = dr^ (C, E) = 2 . However, dr” (C, E) > 7 so T2 ^ T". 

(c) We now assume that {02,^2} = {ci,ji}. Therefore c[,j[ ^ {ci,ji}. We have 
dT{A,C) > 5 and dT"{A,C) = 2 . If T2 = T”, then by Lemma 6.5 (ii), we have 
(c'i,j() = (j2,C2). Therefore, dT^{A,C) = dTi(A,B) = 2 . Regardless of whether the 
second SPR operation involves pruning B or E in Ti, dT2{A,C) = dr^iA^D) = 2 . 
However, dx" (A, D) > 7 , so T2 ^ T". 

Therefore, T2 = T" implies that (ci,C2, j(, Ja) = (ci,C2, ji,72)- 


□ 
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Lemma 6 . 7 . Let T s UB{n) and suppose that we have trees T' = SPR{T, (ci, ji)) and 
T" = SPRiT, (ci,ji), (c2,j2)) where (ci, C2, ji, j2) G S(T). Suppose that there is no path in T in 
which the edges j2, ci, C2, and ji lie in this order. Then 

(i) T" ^ Nspr{T), and 

(a) for all choices of edges ( 4 , 4 , j^) e S(T), ( 4 , 4 , 7^ (ci,C2,ji,^2), we have 

T" = spR{T, 

if and only if {c'^, c'^, j'^) = (c2, ci, ^2, ji)- 

Proof. Let the endpoints of the (ci — ji)-path in T be c and d respectively. Let the subtrees 
at the endpoints of the (ci — ji)-path be Ci and £>1 respectively. Then dT'{Ci, Di) = 2 . Now let 
C and D be subtrees of Ci and Di respectively for which dT"{C, D) = 2 . Because neither C2 nor 
j2 is within distance two of ci or ji, C and D each have at least three internal edges. Therefore, 
C and D are subtrees such that dT{C,D) = dpic^d) > 5 and dT"{C,D) = dT"{c,d) = 2 (note 
that C and D may be internal subtrees). 

Suppose that the edges C2,ci,j2,ji do not lie on a path in T in this order (Case ( 3 ) from 
before the statement of Lemma 6 . 5 ). Let the endpoints of the (c2 — j2)-path in T be a and b 
respectively. Let the subtrees at the endpoints of the (c2 — j2)-path be Ai and Bi respectively. 
Let the subtrees at the endpoints of the (c2 — j2)-path in T' be A2 and B2 respectively. Note 
that a and b are also the endpoints of this path in £'. Now let A = Ai Ci A2 and B = Bi r\ B2. 
Since (ci, C2, ji, j2) G S(T), A and B have at least three internal edges. Note that A and B 
are subtrees at the endpoints of the (c2 — j2)-path in T respectively {A and B may be internal 
subtrees of T). We have dT{A,B) = dT{a,b) > 5 . Since ci cannot be within distance two of 
either C2 or j2, dT'{A,B) = dT'{a,b) > 5 . Finally, dT”{A,B) = dT"{a,b) = 2 . 

Now suppose that the edges C2, ci, j2, ji he on a path in T in this order (Case ( 2 ) from before 
the statement of Lemma 6 . 5 ). Let A be the pendant subtree of T incident to C2, and let B be 
the internal subtree of T incident to both ji and j2- Let the endpoints of the (c2 — j2)-path in T 
be a and b respectively. Then in T, A and B are subtrees at the endpoints of the (c2 — j2)-path 
respectively. Figure 6.6 shows subtrees A, B, C and D. Note that dri^A, B) = dpia, b) > 9 , and 
dT"{A, B) = dT"{a, b) = 2 . The difference between this case and Case ( 3 ), is that here, a and b 
are not the endpoints of the shortest path between A and B in T', as they are in Case ( 3 ). 

(i) From above, we have dpiC^D) = dric^d) > 5 , but dT"{C,D) = dT"{c,d) = 2 . If 
T" G Nspr{T), we either have T" = SPR{T,{ci, ji)) = T' or T" = SPR{T,{ji,ci)) 
by Lemma 6.5 (in). Now dT"{A,B) = 2 , while dpi^A^B) > 5 , and so T" ^ T'. In 
Ti = SPR{T, (ji, Cl)), dpi {A, B) > 5 because ci cannot be within distance two of either 
C2 or j2. Hence T" 7^ Ti. 

(ii) As in Lemma 6.6, if 

T" = 5 Pi?(T,(c'i,j;),(c' 2 ,j^)) 

where (ci, C2, Ja) G §(T), and (ci, c^, ji, ji) 7^ (ci, C2, ji, j2), then by Lemma 6.5 (w), 
we have {{ci, ji}, {c^, ji}} = {{ci, ji}, {c2, j2}}. We consider all possible cases. Let 
Ti = SPR{T, (ci, ji)) and T2 = SPR{T, (ci, ji), (ci, ji)). 

(a) First let (ci, ji) = (c2, j2) and (ci, ji) = (ci, ji). In Case ( 3 ) the first SPR operation 
on T prunes and regrafts Ai so that dT^{Ai,Bi) = 2 . In Case ( 2 ) the first SPR 
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Fig. 6.6. Trees T, = SPR{T^ (ci,ji)) and T" = SPRiT^ (ci,_ 7 i), ( 02 ,^ 2 )) where the edges C 2 ^ ci, j 2 , ji He 
on a path in T in this order. 


operation on T prunes and regrafts A so that dTi{A,B) = 2 . In both cases, the 
endpoints of the (ci — ji)-path in Ti are c and d. Hence dT2(,A,B) = dT2(a,b) = 
dT2{C, D) = dT2(c, d) = 2 and case analysis shows that T2 = T”. So 

T" = SPR{T, (ci,ji), (C2,i2)) = SPR{T, (02,^2), {cuji)). 

(b) Now consider the case where = (ci,ji). Then Ti = T'. Since we know 

that SPR operations on T with different cut and join edges result in distinct trees 
(see [1]), we have 

SPRiT, (ci,ji), (j2,C2)) ^ SPRiT, (ci, ji), (C2,J2)) = T". 

Similarly, 

SPR{T, (C2,j2), (ii,ci)) ^ SPRiT, (C2,i2), (ci,ji)) = T" 

(c) Let X be the subtree of T such that driX, D) = 2 and X does not contain edge ci. 
Then dT'{C,D) = 2 and dT'{C,X) = dT'{D,X) = 3 . Since the cut and join edges 
for the second SPR operation must be at least distance three from ci and ji in 
T, there is a subtree of X, which we denote X', such that X' contains the root 
of X, and dT"iC,X') = dT"iD,X') = 3 . Suppose that ic'i,j[) = (ji,ci). Then 
we know that dTi{C,X) = dT^{D,X) > 6. Again, there exists a subtree X" of X 
such that X" contains the root of X, and dT2{C,X") = dT2{D,X") > 6. Since 

o 

(ci,C2,ji,^2) G S(T) and (c'l,C2, j(, S ^{T), the intersection X between X' and 

o 

X” is non-empty, and X must have at least one leaf of T. Therefore T2 7^ T". The 
same argument applies with subtrees A and B if we consider (cj^, j() = (j2, C2). 















34 


J. V. DE JONG, J. C. MCLEOD AND M. STEEL 


Therefore 


T" = SPRiT, (ci, ji), (C2,i2)) = SPRiT, (02,^2), (ci, ji)), 

but for all other choices of edges (ci, C2, G S(T), (ci, c^, ji, ji) ^ (ci, C2, ji, j2), we 

have T" ^ ,5Pi?(r, (ci, ji), (c^, ji)). 


□ 

We have now established that every pair of SPR operations on a tree T G UB(n) produces 
a second SPR neighbour T (not a first SPR neighbour). The only case where two different pairs 
of SPR operations produce the same second neighbour arises when there is no path in T with 
the edges j2, ci, C2, ji in the order listed, and 

SPRiT, (ci, ji), (C2, J2)) = SPRiT, (c2, J2), (ci, ji)). 

We now count the number of ways the edges j2, ci, C2 and ji can appear in a path in a binary 
tree T in the order given, where the four edges are pairwise at least distance three apart (i.e. 
they are in S(T)). Let this quantity be P(T)- In order to calculate P(T), we need to determine 
the number of paths of all lengths greater than or equal to 13 in T. This quantity is not uniquely 
determined by the number of leaves and cherries of T (Theorem A. 2 ). However, if T is known 
to be a caterpillar or a balanced tree, we can determine the number of paths in T of any given 
length, using the number of leaves of T. In Lemma 6.8 we find expressions for the number of 
paths of a given length in a caterpillar and a balance tree, which we will use to prove Theorem 6 . 1 . 


Lemma 6.8. For n > 4.- 

(i) A caterpillar with n leaves has 4(n — k) paths of length k for 3 < k < n — 1 . 
(ii) Let 


f{k) 



3 ^2 2 — 2 2 ^ , k even] 

2^ — 3 ^2^ , k odd. 


A balanced tree with n = 2® leaves (i > 2 ) has /(fc) paths of length k for 3 < fc < 2* — 1 , 
and a balanced tree with n = 3-2* leaves (i > 1 ) has /(fc) paths of length k for 
3 < fc < 2(i + 1 ). 


Proof. 

(i) A caterpillar T has a single path of n — 3 internal edges. Now pk-2iT) is the number of 
ways to choose A: — 2 of these internal edges so that they are adjacent. This is given by 
Pk-2iT) = (n — 3 ) —(A: — 2 ) + l = n — k. Then by Lemma 2 . 1 , Pfe(T) = pk-2iT) = 4 (n—A:), 
for k > 3 . 

(ii) If T is a balanced tree with n leaves, then it has c = ^ cherries. Let Pkin) be the number 
of paths of length A: in a balanced tree with n leaves, and let pkin) be the number of 
internal paths of length A: in a balanced tree with n leaves. The number of internal paths 
of length A; in r is given by the number of paths of length k in T' where T' is the subtree 
induced by the internal vertices of T. Since T' has ^ leaves, 

Pkin) = Pk (I) , 
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provided n > 6. From Lemma 2.1, 

Pk(n) = 4pk-2(n). 

We have P 2 (n) = n + c —6 = 3 (^ — 2) by Theorem A.2, so if k is even then 

P„(n) = 3 ( 2 '-") - 2 ^ 

= 3(2l-‘)(„-2t). 

We have pi{n) = n — 3 by Lemma 2.2, so if k is odd then 


PfcH=2'=-i 



= 2 ' 


(„-3(2=)) 


Now if n = 2 *, the maximum path length in the tree is given by 2 i — 1 , and if n = 3 • 2 ® then the 
maximum path length in the tree is given by 2 (i + 1 ). 

□ 

Now that we know the number of paths of any given length in a caterpillar or balanced tree, 
we can determine the size of P{T). We are now ready to to prove Theorem 6 . 1 . 

Proof of Theorem 6.1 Suppose that T has a path P of length k, k > 13 . Fix the two 
pendant edges of P as j2 and ji so that j2 is the first edge in P, and ji is the edge in P. All 
pairs of the edges j2, ci, C2, and ji must be distance three or more apart and in the order given. 
So dT{ci,j2) > 3 and dT{ci,ji) > 7 . If ci is the edge in P then 5 < m < k — 8 . Now if C2 
is the edge in P, then m + 4 < j < fc — 4 , so there are (fc — 4 ) — (m + 4 ) + l = fc — m — 7 
possible choices for the location of C2. Finally, it does not matter at which endpoint of P we 
begin counting. So the number of ways of arranging the four edges on this path is 


fc -8 

Rf^ = 2'^{k - m - 7) = {k - ll){k - 12). 

m—'o 


(i) By Lemma 6 . 8 , T has 4(n — k) paths of length k for fc > 3. Hence for a caterpillar. 


P(r) = ^ 4(n-fc)(fc-ll)(fc-12) (6.1) 

fc^l 3 

= in4 + 0(„3). (6.2) 

We know by Lemma 6.6 and Lemma 6.7 that if we count the number of ways to choose 
the edges (ci, C 2 , ji, J 2 ) G S(T), then in the cases not counted by P{T) we count every 
second neighbour twice. For the cases that are counted by P{T) we do not obtain any 
duplicate trees. Therefore by Lemma 6.4, 

^ ( 5 ^" + 0{n^) - P(T)) + P(T) 

= ^ + O(n^) = in" + O(n^). 




36 


J. V. DE JONG, J. C. MCLEOD AND M. STEEL 


(ii) Similarly for a balanced tree T with n = 3(2)* leaves {i > 1), we can sum over even and 
odd path lengths (see Lemma 6.8) to obtain 

n—1 

P{T)^ ^ Pk{T)ik-ll){k-U) 

k^l3 

log2(t) + l 

= 2 (3(2”*-i)(n-2"*)(2m-ll)(2m-12)) + 

m—7 
log2(f ) + l 

(2™ (n-3(2™-2)) (2 to-12)(2to-13)) 

m—7 

= 0(n^ln(n)^) = 0{n^). 

If T is a balanced tree with n = 2* leaves (i > 2), then instead we have 
log2(f ) + l 

P{T)= Y (3(2”*“^) (n-2™)(2m-ll)(2m-12)) + 

m—7 
log2(f )+2 

Y (2™ (n-3(2"*-^)) (2 to-12)(2to-13)) 

m—7 

= = O(n^). 


Therefore, for any balanced tree T, 

^ + 0{n^) = in" + 0(n3). 

D 

This shows that the size of the second SPR neighbourhood of a tree cannot be uniquely 
determined by the number of leaves of the tree. We now prove Theorem 6.2, which shows that 
the number of leaves and cherries is insufficient. 

Proof of Theorem 6.2 Suppose that n = 3m and c = 3, where m > 7. Consider the 
tree Ti of Type I, with n leaves and c cherries (see Fig. 6.2). For any pair of vertices a;, y, let 
Cxy be the caterpillar formed by the path between vertices x and y in Ti and all of the edges 
incident to vertices on that path. Let a, b and d be the roots of the three cherries of Ti, such that 
dTi (a, b) = 2. Let c be the vertex in Ti that is not adjacent to a leaf. Both of the caterpillars Cad 
and Cm have n — 1 leaves. If we find P{Cad) and P{Cm), then we will have found every way of 
choosing the edges ci, C 2 , ji and j 2 so that all four edges are on a path in the order j 2 , ci, C 2 , ji- 
Eliminating double counting, we have 

P(Ti) = P{Cad) + P{Cm) - P{Cm) = 2P{Cad) - P{Cm). 

We do not consider the caterpillar Cab because it is too short to have any paths of length 13 or 
more. So by Equation 6.2, 
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P{Ti) = ^(n - 1)'‘ - i(n - + 0{n^) = + 0(71^). 

Now let T 2 be the tree of Type II with n leaves, c cherries and maximum path length 2m 
(see Fig. 6.3). Let a, b and d be the roots of the three cherries of T 2 , and let c be the vertex in 
T 2 that is not adjacent to a leaf. By the same process as above, 

P(T 2 ) = P{Cad) + P{Cm) + P{Cab) - P{Cac) - P{Cbc) " P{C,d) = ^Cad) " 3P(aJ. 
Now Cad bas 2m + 1 leaves and Cac has m + 2 leaves, so 

P(P 2 ) = (2m + 1)^ - (m + 2)4 + 0 {n^) 

= + 1)'‘ - + 2)4 + O(n^) 

= ln^ + 0 {n^). 

Therefore \N^pr(Ti)\ = + 0{n^) and |IV|pfl(r 2 )| = ||n4 + 0{n^). 

D 

Since Ti and T 2 have the same number of leaves and cherries, it is clear that other properties 
of the tree T would be required to get an exact formula for the highest order term of |IV|pp(T)|. 

7 Concluding Comments In this paper, we derived new results for the sizes of the 
first and second RF neighbourhoods of an unrooted binary tree, and we extended the result of 
Robinson [22] for the third NNI neighbourhood of an unrooted binary tree (see Appendix A). In 
addition, we calculated new asymptotic results for the sizes of the RF and NNI neighbour¬ 
hoods of a binary phylogenetic tree. We also found an upper bound on the proportion of binary 
trees that share at least k non-trivial splits with a given tree on the same leaf set, and found an 
expression for the number of pairs of binary trees that share a first neighbour under the RF and 
NNI metrics. 

In our results for the size of the RF and NNI neighbourhoods of an unrooted binary tree 
T (Theorems 3.1 and 4.1), the term of order contains a parameter dependent on T and k. 
We have calculated bounds on the value of this parameter: for RF, _ sfc +7fc ^ CT,k < 4fc^ — 7fc; 
for NNI, < 3fc(fc — 2). These bounds are not strict, so it would be interesting 

to investigate ways of improving them. A natural question is whether or not both positive and 
negative values of C't,^ and Pr.fc are possible for any given value of k, and if so, whether we can 
find examples of such trees. 

We showed that in contrast to RF and NNI, the size of the second SPR neighbourhood is 
not solely dependent on the number of leaves and cherries of the tree. Humphries and Wu [17] 
showed that for TBR even the first neighbourhood depends on variables other than the number 
of leaves and cherries. 

Throughout this paper, we have considered neighbourhoods of unrooted binary trees under 
the three metrics; RF, NNI, and SPR. There are, however, many other metrics that can be used 
to compare trees, and which would be interesting to investigate. For example, Humphries and 
Wu [17] found an expression for the size of the first TBR neighbourhood of a tree, that depends 
on variables other than the number of leaves and cherries. Moulton and Wu [21] recently defined 
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a new metric dp^ which is similar to the TBR metric. (The same metric was also independently 
defined by Kelk and Fischer [18].) Using the result of Humphries and Wu [17], Moulton and 
Wu [21] calculated the size of the first neighbourhood of an unrooted binary tree under this 
metric. 

Given the difficulty of calculating the size of the second SPR neighbourhood, it is possible 
that similar problems would arise in calculating the size of the second neighbourhood under TBR 
or dp. However, this would be interesting to investigate, and it may be possible to find the size 
of the second TBR or dp neighbourhood of a particular type of tree, such as a caterpillar or a 
balanced tree. 
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Appendices 


A Third NNI Neighbourhood Theorem A.l. Let T s UB(ji,c) (n > p. Then 
l-^Ar7v/('^)l = + 8cn + 12p3(T) + 164. 


Proof. Let x be the number of ways of choosing three distinct internal edges so that no 
pair is adjacent, let y be the number of ways of choosing these edges so that exactly one pair is 
adjacent, and let z be the number of ways of choosing these edges so that all pairs are adjacent. 
Let t be the number of ways of choosing two adjacent edges of T. Robinson [22] showed that 


where 


= 8a: + 16y + 24z + SdpsiT) + 2t, 


' + y + z+p^{T) = 


(n — 3)(n — 4)(n — 5) 

6 ’ 


(A.1) 


(A.2) 


t = n + c — 6< 


3(n — 4) 


PsT) < 2n — 12 for n > 7 


^ 

z = c — 2 < - for n > 4 , 


and for n > 7, if n is odd, then y < — 16n + 42 and if n is even, then y < + 45. 


There are (n — 5)t ways of choosing three distinct internal edges such that at least one pair 
is adjacent, so we have 


y = [n- 5)(n + c - 6) - 2 p3{T) - 3z, 

where 2 p3{T) is the number of cases where the three edges form a path of length three, and 3z 
is the number of cases where all three edges share an endpoint. 

It follows from Equation (A.2) 

(n — 3)(n — 4)(n — 5) 


X = 


6 


y - z-psiT). 
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Hence by Equation (A.l) 

l^^iv/(T’)l = 8^ + 16y + 24z + 36p3(T) + 2t 

= — 8n^ — + 8cn + 12p3(T) + 164. 

O O 


□ 

Theorem A.l tells us the size of the third NNl neighbourhood in terms of the number of 
leaves, cherries and internal paths of length three. We now consider how to determine the num¬ 
ber of internal paths of length three. 

Theorem A. 2. Let T G UB{n, c) (n > A). Then Pi{T) = n — 3, P 2 {T) = n -|- c — 6, and for 
fc > 3, 


Pk(T) = Apk-2{T) - hk{T) - mfe(r), 

where for all values of k, mk{T) is the number of paths of length k in T where both endpoints 
are leaves ofT, and hk(T) is the number of paths of length k in T where exactly one endpoint is 
a leaf of T. 

Proof. The number of internal edges in T is n — 3, so Pi{T) = n — 3. The number of pairs 
of adjacent internal edges is n -I- c — 6 , so P2{T) = n -|- c — 6 . The number of paths of length k in 
T is Tfe(T) = Pk{T) -I- mk{T) + hk{T). Now by Lemma 2.1, Pfe(T) = 4pfe_2(T). Therefore 

Pk{T) = Pk{T) - mk{T) - hk{T) = Apk-2{T) - hk{T) - mk{T). 


□ 

It follows that P 3 (T) = 4(n — 3) — /i 3 (T) — m 3 (T) for a tree T G UB{n, c), and therefore 
I^Ar7v/(T’)| = - 8n^ + + 8cn - 46c - 12 h 3 (T) - 12 m 3 (r) 20. 

o o 

Note that mk{T) and hk{T) can both be counted using a breadth-first search in polynomial 
time. 

B Proof of Lemma 4.6 First, suppose that edges Ci and 62 are non-adjacent in T. Let 
A be the subtree containing 62 such that dr (A, a) = 1. Let the other three subtrees distance 
one from ci be B, C and D. First we consider NNI{T-, 61,62). The first operation swaps two 
of the subtrees incident to ci to obtain Ti G NNI{T-,ei). We then perform an NNl operation 
on edge 62 in Ti. We obtain a tree T 2 with a subtree A' such that dT 2 (A',ei) = 1, and B, C 
and D are the other three subtrees distance one from ci. Now consider NNI{T; 62,01). First, 
we perform an NNl operation on edge 62 in A (in T), and one of the two distinct trees produced 
is T{ G NNI{T; 62) with subtree A' where dT'(A',ei) = 1, and B, C and D are the other three 
subtrees distance one from ci. The second operation swaps two of the subtrees at distance one 
from Cl in T{, which are A!, B, C and D. One of the two distinct trees obtained is T 2 , and so 
T2 G NNI{T;e2, Ci). This is true for all T 2 G NNI{T-, 61,62), so P C Q. Similarly, Q Q P and 
so P = Q. 

Now suppose that internal edges ei and 62 are adjacent. Let A and B be subtrees such 
that dT{A,6i) = dT{B,ei) = 1 and driA, 62) = dr^B, 62) = 2 . Let C and D be subtrees such 
that driC, 62 ) = dr^D, 62 ) = 1 and dxiC, ei) = driD, ci) = 2. Let E be the subtree such that 
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dr^E, ei) = driE, 62 ) = 1- This can be seen in Fig. B.l. 


c 



Fig. B.l. A general structure for an unrooted binary tree T, showing subtrees A, B, C, D, and E. 

First, we consider 7V7V/(T; ei, 62 ). Let Ti S NNI{T\ei) and T 2 € NNI{Ti\e 2 )- The 
first operation is on Ci, so either dT^{A,E) = 2 or dTi{B,E) = 2. Without loss of generality, 
suppose dxi {A, E) = 2. Then (Iti [E, 62 ) = dr^ {A, 62 ) = 2. Therefore after the second operation, 
dT 2 iE,A) = 2. We now consider 7ViV/(T; 62 , ei). Let e NNI(T;e 2 ) and ^ NNI{T[\ei). 
The first NNI operation is on 62 , so dT^{A,E) = 4. Therefore dT^{A,E) > 3. Hence T 2 ^ T 2 . 
The choice oiT 2 G P and T 2 G Q were arbitrary, so P GiQ — Ih. 

D 



